This document is compiled from an Rmarkdown file which contains all code necessary to reproduce the analysis for the accompanying manuscript “Copy-number signatures and mutational processes in ovarian carcinoma”. Details on how to compile the document can be found in the repository README: https://bitbucket.org/britroc/cnsignatures. Much of the code for signature analysis can be found in the main_functions.R and helper_functions.R files in the base directory of the repositoy. Each section below describes a different component of the analysis and all numbers and figures are generated directly from the underlying data on compilation.

Data preprocessing

The analysis in this document starts from preprocessed copy-number profiles for all datasets due to restrictions on the underlying raw data. Details on how to reproduce the results from raw data can be found in the main manuscript.

BriTROC-1 sample filtering and curation

A total of 300 samples were shallow whole-genome and tagged amplicon sequenced. Raw data were processed as outlined in the main manuscript. After inspection of the TP53 mutation status and relative copy-number profiles, 47 were excluded from downstream analysis for the following reasons:

## 
##  low purity mislabelled   not HGSOC     no TP53 
##          24           7           3          13

Samples were then processed using ABCEL (to be released) to generate absolute copy-number profiles. All samples were manually inspected and the copy-number fit adjusted if necessary (based on evidence from other samples from the same patient).

Following absolute copy-number fitting, the samples were rated using a 1-3 star system.

  • 1 star samples showed a noisy copy-number profile and were considered likely to have incorrect segments and missing calls. These were excluded from further analysis.
  • 2 star samples showed a reasonable copy-number profile with only a small number of miscalled segments. These samples were used (with caution) for some subsequent analyses.
  • 3 star samples showed a high-quality copy-number profile that was used in all downstream analyses.

Star rating per sample:

## 
##   1   2   3 
##  54  52 147

Maximum star rating per case:

star_rating n
1 15
2 26
3 91

Assessing absolute copy-number fits

To test the performance of our absolute copy-number fiting from sWGS, we compared ploidy and purity estimates across a subset of samples with matched deepWGS. The estimates for deepWGS samples were extracted from Battenberg plots for those samples.

# Read in ploidy and purity estimates obtained from Battenberg plots.

deep_metrics <- read.csv("data/britroc_deepWGS_ploidy_purity_49.csv", header=T, as.is = T)
deep_metrics$purity <- deep_metrics$purity/100
deep_metrics <- deep_metrics %>%
  mutate(name= sub("[0-9]+_tumor_","",x=wgs_ID)) %>%
  dplyr::select(name,ploidy, purity) %>%
  tidyr::gather("metric", "deep", -name)

# Get ploidy for sWGS samples

CN <- readRDS("data/britroc_absolute_copynumber.rds")
Biobase::pData(CN)$ploidy <- getPloidy(CN) %>%
  .$out

shallow_metrics <- Biobase::pData(CN) %>%
  dplyr::select(name, ploidy, purity) %>%
  filter(name %in% deep_metrics$name) %>%
  tidyr::gather("metric", "shallow", -name)


# Annotate sWGS samples with star_rating
samp_annotation_all <- read.csv("data/britroc_sample_data.csv", as.is=T)
samp_annot <- samp_annotation_all %>% 
  filter(IM.JBLAB_ID %in% shallow_metrics$name & Failed !="Y") %>%
  dplyr::select(IM.JBLAB_ID , star_rating)

# Combine estimates from deep and shallow WGS methods and retain 3-star samples
metrics <- inner_join(deep_metrics, shallow_metrics, by = c("name", "metric")) %>%
  inner_join(samp_annot,c("name" = "IM.JBLAB_ID" ) ) %>%
  filter(star_rating ==3)

# Perform correlation test of ploidy, purity in Britroc samples with dWGS and sWGS data
correlations <- metrics %>% 
  group_by(metric) %>%
  do(broom::tidy(cor.test(.$deep,.$shallow)))
# Create dataframes for annotation and generate plot for 3-star samples.

correlations$metric <- factor(correlations$metric, levels=c("ploidy", "purity"))

annot_text <- correlations %>% 
  ungroup() %>%
  dplyr::select(metric, estimate, p.value) %>%
  mutate(x=c(2,0.4),y.cor=c(4.5,0.9), y.pval= c(4.375, 0.87)) %>%
  group_by(metric) %>%
  mutate(label.cor=paste0("R^2== ",round(estimate,2)), label.pval = paste0("P== ", format(signif(p.value,1),digits=1, scientific = T)) )


cor_plot <- metrics %>%
  ggplot(aes(x=deep, y=shallow)) +
  geom_point()  + 
  geom_smooth(method = "lm") + 
  facet_wrap(~metric, scales = "free") +
  labs(x= "Deep WGS", y= "sWGS") +
  theme_bw()+
  theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
  strip.text.x = element_text(size = 7), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
  

cor_plot + 
  geom_text(data=annot_text, aes(label=label.cor, x=x,y=y.cor), parse = T,inherit.aes=FALSE, size=2) +
  geom_text(data=annot_text, aes(label=label.pval, x=x,y=y.pval), parse = T,inherit.aes=FALSE, size=2)

These plot show high concordance between ploidy and purity estimated using sWGS and deep WGS. Only two samples showed a significantly different ploidy estimate between sWGS and deepWGS. On closer inspection of the deepWGS results both samples were deemed whole-genome duplication uncertain, meaning both ploidy solutions could have been likely.

Extracting copy-number features

For signature analysis we opted to use cases with 3 star copy-number profiles (total=91).

In order to perform copy-number signature identification we summarised each copy-number profile using a number of different feature distributions:

  • Segment size
  • Breakpoint number (per ten megabases)
  • Change-point copy-number
  • Segment copy-number
  • Breakpoint number (per chromosome arm)
  • Length of segments with oscillating copy-number
#read in absolute CN post ABCEL processing
all_CN<-readRDS("data/britroc_absolute_copynumber.rds")
all_CN<-all_CN[,colnames(all_CN)%in%samp_annotation[!samp_annotation$Failed=="Y","IM.JBLAB_ID"]]

#extract 3 star CN from each case
ids<-result %>% filter(star_rating==3)
ids<-ids$IM.JBLAB_ID
hq_CN<-all_CN[,colnames(all_CN)%in%ids]
ids<-result %>% filter(star_rating==2)
ids<-ids$IM.JBLAB_ID
lq_CN<-all_CN[,colnames(all_CN)%in%ids]

#estract copy-number features
CN_features<-extractCopynumberFeatures(hq_CN,cores=num_cores)

Following copy-number feature extraction we applied mixture modelling to breakdown each feature distribution into mixtures of Gaussian or mixtures of Poisson distributions using the flexmix package.

seed=77777
min_prior=0.001
model_selection="BIC"
nrep=1
niter=1000

dat<-as.numeric(CN_features[["segsize"]][,2])
segsize_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
                         min_prior=min_prior,niter=niter,nrep=nrep,min_comp=10,max_comp=10)

dat<-as.numeric(CN_features[["bp10MB"]][,2])
bp10MB_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
                        min_prior=min_prior,niter=niter,nrep=nrep,min_comp=3,max_comp=3)

dat<-as.numeric(CN_features[["osCN"]][,2])
osCN_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
                      min_prior=min_prior,niter=niter,nrep=nrep,min_comp=3,max_comp=3)

dat<-as.numeric(CN_features[["bpchrarm"]][,2])
bpchrarm_mm<-fitComponent(dat,dist="pois",seed=seed,model_selection=model_selection,
                          min_prior=min_prior,niter=niter,nrep=3,min_comp=2,max_comp=5)

dat<-as.numeric(CN_features[["changepoint"]][,2])
changepoint_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
                             min_prior=min_prior,niter=niter,nrep=nrep,min_comp=7,max_comp=7)

dat<-as.numeric(CN_features[["copynumber"]][,2])
copynumber_mm<-fitComponent(dat,seed=seed,model_selection=model_selection,
                            nrep=nrep,min_comp=2,max_comp=10,min_prior=0.005,niter=2000)

CN_components<-list(segsize=segsize_mm,bp10MB=bp10MB_mm,osCN=osCN_mm,changepoint=changepoint_mm,copynumber=copynumber_mm,bpchrarm=bpchrarm_mm)

Once the components were identified we then generated a sample-by-component matrix representing the sum of posterior probabilities of each copy-number event being assigned to each component.

britroc_sample_component_matrix<-generateSampleByComponentMatrix(CN_features,CN_components,cores=1,subcores=num_cores)
NMF::aheatmap(britroc_sample_component_matrix,fontsize = 7,Rowv=FALSE,Colv=FALSE,legend = T,breaks=c(seq(0,199,2),500),main="Component x Sample matrix")

Signature identification

We used the NMF package to factorise the sample-by-component matrix into a signature-by-sample matrix and component by signature matrix.

Selecting the optimum number of signatures

Previous studies suggest that there should be a minimum of roughly 4 detectable signatures representing mutagenic processes which give rise to distinct patterns of copy-number change: breakage-fusion-bridge, duplicator phenotype, chromothripsis, and amplifier phenotype. Our ability to detect these depends on whether each pattern is sufficiently represented in our data. As NMF demands the upper bound on the number of signatures be << than both the sample and component number, we chose a signature search interval of [3,12]. We ran the matrix factorisation 1000 times for each number of signatures, each with a different random seed, and compared the cophentic, dispersion, silhouette, and sparseness scores for the signature-feature matrix (basis), patient-signature matrix (coefficients) and a consensus matrix of patient-by-patient across the 1000 runs. In addition we performed 1000 random shuffles of the input matrix to get a null estimate of each of the scores.

nmfalg<-"brunet"
seed<-77777
estim.r <- NMF::nmfEstimateRank(t(britroc_sample_component_matrix), 3:12,seed = seed,nrun=1000,
                            verbose=F,method=nmfalg,.opt = paste0("p",num_cores))
V.random <- randomize(t(britroc_sample_component_matrix))
estim.r.random <- NMF::nmfEstimateRank(V.random, 3:12, seed =seed,nrun=1000,
                                   verbose=F,method=nmfalg,.opt = paste0("p",num_cores))
p<-plot(estim.r,estim.r.random, 
      what = c("cophenetic", "dispersion","sparseness", "silhouette"),xname="Observed",yname="Randomised",main="")+
    theme(axis.text=element_text(size=5),axis.title=element_text(size=5),
                    strip.text.x = element_text(size = 5),
                    strip.text.y = element_text(size = 5),
                    legend.text = element_text(size = 5),
                     legend.title = element_text(size = 7))
g<-ggplotGrob(p)
g[["grobs"]][[2]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[3]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[4]]$children[[4]]$size[[1]]<-0.5
g[["grobs"]][[5]]$children[[4]]$size[[1]]<-0.5
grid::grid.newpage()
grid::grid.draw(g)

The plots above suggest that the optimal number of signatures is 7. A signature number higher than this would force the sparseness in the signature by feature matrix (basis) to be greater than that which could be obtained by randomly shuffling the input matrix.

Results of NMF

Derived matrices as heatmaps:

nsig<-7
sigs<-NMF::nmf(t(britroc_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = paste0("p",num_cores))
coefmap(sigs,Colv="consensus",tracks=c("basis:"), main="Patient x Signature matrix")

basismap(sigs,Rowv=NA,main="Signature x Component matrix")

Signature identification validation

To test the robustness of the signature identification approach we applied the same procedure to two independent datasets: deep whole-genome sequencing (approximately 40x) of high-grade serous ovarain cancer samples processed as part of the Pan-Cancer Analysis of Whole Genomes project (PCAWG); and SNParray profiling of HGSOC cases as part of TCGA.

Signature identification in 112 PCAWG cases

We took the ABSOLUTE copy-number profiles of 112 cases and subjected them to the same feature extraction and NMF procedure as above:

pcawg_CN_features<-readRDS("data/pcawg_CN_features.rds")
pcawg_sample_component_matrix<-generateSampleByComponentMatrix(pcawg_CN_features,CN_components)

NMF::aheatmap(pcawg_sample_component_matrix,Rowv=NULL, main="Component x Sample matrix")

pcawg_ids<-rownames(pcawg_sample_component_matrix)

pcawg_sigs<-NMF::nmf(t(pcawg_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = "p16")
coefmap(pcawg_sigs,Colv="consensus",tracks=c("consensus:"), main="Sample x Signature matrix")#annCol=annCol,

basismap(pcawg_sigs,Rowv=NA,main="Signature x Component matrix")

Signature identification in 415 TCGA samples

We used ABSOLLUTE copy-number calls for 415 samples and applied the same signature identification procedure as above:

tcga_CN_features<-readRDS("data/tcga_CN_features.rds")
tcga_sample_component_matrix<-generateSampleByComponentMatrix(tcga_CN_features,CN_components)
NMF::aheatmap(tcga_sample_component_matrix,Rowv=NULL, main="Sample x Component matrix")

tcga_ids<-rownames(tcga_sample_component_matrix)

tcga_sigs<-NMF::nmf(t(tcga_sample_component_matrix),nsig,seed=seed,nrun=1000,method=nmfalg,.opt = "p16")
coefmap(tcga_sigs,Colv="consensus",tracks=c("consensus:"), main="Sample x Signature matrix")#annCol=annCol,

basismap(tcga_sigs,Rowv=NA,main="Signature x Component matrix")

Comparison of signatures across cohorts

We calculated the Spearman rank correlation between feature vectors for each signature to assess how similar the signatures were across the BRITROC, PCAWG and TCGA cohorts:

reord_components<-c(11:13,24:31,17:23,32:36,14:16,1:10)

#britroc feat_sig matrix
feat_sig_mat<-basis(sigs)
reord_britroc<-as.integer(c(2,6,5,4,7,3,1))
names(reord_britroc)<-paste0("s",1:7)
feat_sig_mat<-feat_sig_mat[,reord_britroc]
colnames(feat_sig_mat)<-paste0("s",1:nsig)
sig_feat_mat<-t(feat_sig_mat)

#pcawg feat_sig matrix
feat_sig_mat_pcawg<-basis(pcawg_sigs)[,]
sig_feat_mat_pcawg<-t(feat_sig_mat_pcawg)
colnames(feat_sig_mat_pcawg)<-paste0("s",1:nsig)

#tcga feat_sig matrix
feat_sig_mat_tcga<-basis(tcga_sigs)[,]
sig_feat_mat_tcga<-t(feat_sig_mat_tcga)
colnames(feat_sig_mat_tcga)<-paste0("s",1:nsig)


#determine matching signatures and their correlation
reord_pcawg<-apply(feat_sig_mat,2,function(x){which.max(apply(feat_sig_mat_pcawg,2,cor,x,method="s"))})
sigcor_pcawg<-apply(feat_sig_mat,2,function(x){max(apply(feat_sig_mat_pcawg,2,cor,x,method="s"))})

reord_tcga<-apply(feat_sig_mat,2,function(x){which.max(apply(feat_sig_mat_tcga,2,cor,x,method="s"))})
sigcor_tcga<-apply(feat_sig_mat,2,function(x){max(apply(feat_sig_mat_tcga,2,cor,x,method="s"))})

#plot the feat_sig matrices side by side
par(mfrow=c(1,3))
basismap(sigs,Rowv=NA,Colv=reord_britroc,main="BritROC",tracks=NA)
basismap(pcawg_sigs,Rowv=NA,Colv=reord_pcawg,main="PCAWG",tracks=NA)
basismap(tcga_sigs,Rowv=NA,Colv=reord_tcga,main="TCGA",tracks=NA)

#reorder feat_sig matrix
sig_feat_mat_pcawg<-data.frame(sig_feat_mat_pcawg[reord_pcawg,],stringsAsFactors = F)
rownames(sig_feat_mat_pcawg)<-paste0("s",1:nsig)
feat_sig_mat_pcawg<-t(sig_feat_mat_pcawg)

sig_feat_mat_tcga<-data.frame(sig_feat_mat_tcga[reord_tcga,],stringsAsFactors = F)
rownames(sig_feat_mat_tcga)<-paste0("s",1:nsig)
feat_sig_mat_tcga<-t(sig_feat_mat_tcga)

b_vs_p<-c()
for(i in 1:nsig)
{
  b_vs_p<-c(b_vs_p,cor.test(feat_sig_mat[,i],feat_sig_mat_pcawg[,i])$p.value)
}  
  
b_vs_t<-c()
for(i in 1:nsig)
{
  b_vs_t<-c(b_vs_t,cor.test(feat_sig_mat[,i],feat_sig_mat_tcga[,i])$p.value)
}

b_vs_p<-p.adjust(b_vs_p,method="BH")
b_vs_t<-p.adjust(b_vs_t,method="BH")

sig_comp_out<-rbind(
  format(round(sigcor_pcawg,2),scientific=F),
  signif(b_vs_p,1),
  format(round(sigcor_tcga,2),scientific=F),
  signif(b_vs_t,1))

rownames(sig_comp_out)<-c("ICGC correlation","ICGC p-value","TCGA correlation","TCGA p-value")
colnames(sig_comp_out)<-1:nsig

knitr::kable(sig_comp_out)
1 2 3 4 5 6 7
ICGC correlation 0.78 0.91 0.95 0.88 0.93 0.73 0.84
ICGC p-value 1e-43 6e-15 3e-21 2e-15 5e-26 9e-05 2e-09
TCGA correlation 0.73 0.84 0.89 0.93 0.71 0.56 0.92
TCGA p-value 5e-41 8e-14 4e-15 3e-11 1e-08 3e-08 1e-21

The heatmap and signature correlations above show that the signatures derived from the deep whole-genome sequencing are similar to those derived using sWGS.

Signature exposure matrices were normalised to sum to one and exposures less than 0.01 were considered 0.

sig_thresh<-0.01

sig_pat_mat_hq<-scoef(sigs)
sig_pat_mat_hq<-sig_pat_mat_hq[reord_britroc,]
rownames(sig_pat_mat_hq)<-paste0("s",1:nsig)
sig_pat_mat_hq<-normaliseMatrix(sig_pat_mat_hq,sig_thresh)

sig_pat_mat_pcawg<-scoef(pcawg_sigs)
rownames(sig_pat_mat_pcawg)<-paste0("s",1:nsig)
sig_pat_mat_pcawg<-sig_pat_mat_pcawg[reord_pcawg,]
rownames(sig_pat_mat_pcawg)<-paste0("s",1:nsig)
sig_pat_mat_pcawg<-normaliseMatrix(sig_pat_mat_pcawg,sig_thresh)

sig_pat_mat_tcga<-scoef(tcga_sigs)
rownames(sig_pat_mat_tcga)<-paste0("s",1:nsig)
sig_pat_mat_tcga<-sig_pat_mat_tcga[reord_tcga,]
rownames(sig_pat_mat_tcga)<-paste0("s",1:nsig)
sig_pat_mat_tcga<-normaliseMatrix(sig_pat_mat_tcga,sig_thresh)

Copy-number signature assignment

We assigned all 2-star copy-number profile samples signature proportions using the LCD function from the YAPSA package.

#extract features for lower quality samples
lowquality_britroc_CN_features<-extractCopynumberFeatures(lq_CN)
lowquality_britroc_sample_component_matrix<-generateSampleByComponentMatrix(lowquality_britroc_CN_features,CN_components)
britroc_lq_ids<-rownames(lowquality_britroc_sample_component_matrix)

#assign signatures
sig_pat_mat_lq<-YAPSA::LCD(t(lowquality_britroc_sample_component_matrix),feat_sig_mat)
rownames(sig_pat_mat_lq)<-paste0("s",1:nsig)
sig_pat_mat_lq<-normaliseMatrix(sig_pat_mat_lq,sig_thresh)

#assign signatures to remaining 2 and 3 star samples (for cases with multiple samples)
remain_samp<-samp_annotation[(!samp_annotation$IM.JBLAB_ID%in%colnames(cbind(sig_pat_mat_hq,sig_pat_mat_lq)))&(!samp_annotation$star_rating==1),]
remain_CN<-all_CN[,colnames(all_CN)%in%remain_samp$IM.JBLAB_ID]

remain_britroc_CN_features<-extractCopynumberFeatures(remain_CN)
remain_britroc_sample_component_matrix<-generateSampleByComponentMatrix(remain_britroc_CN_features,CN_components)

britroc_remain_ids<-rownames(remain_britroc_sample_component_matrix)

sig_pat_mat_remain<-YAPSA::LCD(t(remain_britroc_sample_component_matrix),feat_sig_mat)
rownames(sig_pat_mat_remain)<-paste0("s",1:nsig)
sig_pat_mat_remain<-normaliseMatrix(sig_pat_mat_remain,sig_thresh)

sig_pat_mat_britroc<-cbind(sig_pat_mat_hq,sig_pat_mat_lq)
sig_pat_mat_britroc_all<-cbind(sig_pat_mat_britroc,sig_pat_mat_remain)

Visualisations

Copy-number signature component weights

To adopt the same visualisation that has been used for SNV signatures, we generated a histogram for each signature, containing the relative weighting of each of the components, colour coded by the feature distribution they come from.

Feature distributions for BRITROC cohort

Weighted density across all BritROC samples for each signature

Mixture models

Shaded mixture models for signature 1

Component means and standard deviations

mean sd component
segsize1 4.268619e+05 1.869249e+05 segsize1
segsize2 1.081858e+06 4.073021e+05 segsize2
segsize3 2.233030e+06 7.490920e+05 segsize3
segsize4 4.303321e+06 1.115832e+06 segsize4
segsize5 7.304340e+06 1.644307e+06 segsize5
segsize6 1.047933e+07 2.972414e+06 segsize6
segsize7 1.641912e+07 5.226151e+06 segsize7
segsize8 2.950832e+07 9.703791e+06 segsize8
segsize9 5.863890e+07 2.049900e+07 segsize9
segsize10 1.183110e+08 4.521059e+07 segsize10
copynumber1 9.979984e-01 1.028012e-01 copynumber1
copynumber2 1.981355e+00 1.214921e-01 copynumber2
copynumber3 2.561682e+00 1.002305e+00 copynumber3
copynumber4 2.991089e+00 1.440896e-01 copynumber4
copynumber5 3.970519e+00 1.714569e-01 copynumber5
copynumber6 4.271647e+00 1.584293e+00 copynumber6
copynumber7 8.392609e+00 3.501494e+00 copynumber7
copynumber8 3.086723e+01 2.315811e+01 copynumber8
changepoint1 4.945265e-01 1.564534e-01 changepoint1
changepoint2 9.683467e-01 1.562940e-01 changepoint2
changepoint3 1.178598e+00 2.266991e-01 changepoint3
changepoint4 1.822407e+00 3.990779e-01 changepoint4
changepoint5 3.006858e+00 1.039581e+00 changepoint5
changepoint6 7.314942e+00 3.459220e+00 changepoint6
changepoint7 2.873467e+01 2.205516e+01 changepoint7
bpchrarm1 6.154320e-02 NA bpchrarm1
bpchrarm2 2.622567e+00 NA bpchrarm2
bpchrarm3 7.777202e+00 NA bpchrarm3
bpchrarm4 1.754649e+01 NA bpchrarm4
bpchrarm5 3.353068e+01 NA bpchrarm5
bp10MB1 6.470000e-05 NA bp10MB1
bp10MB2 1.255291e+00 NA bp10MB2
bp10MB3 4.074583e+00 NA bp10MB3
osCN1 3.394844e-01 NA osCN1
osCN2 2.627145e+00 NA osCN2
osCN3 9.587145e+00 NA osCN3

Exposures across cohorts

b<-reshape2::melt(sig_pat_mat_britroc)
b$cohort<-"BriTROC"
p<-reshape2::melt(sig_pat_mat_pcawg)
p$cohort<-"PCAWG"
t<-reshape2::melt(sig_pat_mat_tcga)
t$cohort<-"TCGA"

pdat<-rbind(b,p,t)
colnames(pdat)<-c("Signature","Patient","Exposure","Cohort")
pdat$Signature<-plyr::revalue(pdat$Signature,
                            c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

signif<-ggpubr::compare_means(Exposure ~ Cohort,data=pdat,group.by = c("Signature"),
              method="wilcox.test",p.adjust.method="BH")
signif<-signif %>% filter(p.signif!="ns") %>% group_by(Signature) %>% filter(p.format==min(p.format))
signif$Exposure<-1

knitr::kable(signif)
Signature .y. group1 group2 p p.adj p.format p.signif method Exposure
2 Exposure BriTROC TCGA 0.0051493 0.0180224 0.0051 ** Wilcoxon 1
3 Exposure BriTROC TCGA 0.0033635 0.0141267 0.0034 ** Wilcoxon 1
4 Exposure BriTROC PCAWG 0.0000003 0.0000028 2.7e-07 **** Wilcoxon 1
5 Exposure BriTROC PCAWG 0.0005020 0.0035137 0.0005 *** Wilcoxon 1
ggplot(pdat,aes(x=Signature,y=Exposure,fill=Cohort))+geom_boxplot(alpha=0.3)+
  my_theme+scale_fill_manual(values = cbPalette)+coord_cartesian(ylim=c(0,1))+
  annotate("text",x = signif$Signature,y=signif$Exposure,label=signif$p.signif)

Survival analysis

Data preprocessing

This section describes how the survival intervals in the Britroc-1 cohort were calculated from raw data files generated from the Britroc clinical trial database. The following raw data files required for the calculation can be obtained by writing to the Data Access Committee:
1. reg.csv - Registration Information
2. bas.csv - Baseline Information
3. flchm.csv - First Line Chemotherapy Information
4. flrsp.csv - First Line Response Information
5. dnf.csv - Death Notification Information
6. biop.csv - Image Guided Biopsy Information

Records in the flchm.csv file were filtered to retain only patients that had received platinum-based therapy as first line treatment. In instances where only month and year were available for certain time points the date was assumed to be the 1st of the month to enable calculations.

# run this code if processing from raw data files placed in the 'restricted_data/' folder
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(zoo))

filepath <-"restricted_data/"
filenames <- c("flchm", "flrsp", "bas", "dnf", "reg", "biop")

for(fname in filenames){
  assign(fname,read.csv(paste0(filepath,fname,".csv"),na.strings=c(""," ","NA"), as.is = T))
}

# subset flchm to platinum-based therapy only

flchm <- flchm %>%
  filter(grepl("PLAT", DRUG1)) %>%
  select(-DRUG1)

# use full_join to keep missing information across datasets, combine dataframes by TRIALNO and retain only columns necessary for subsequent analysis

tokeep <- c("STUDYNO", "TRIALNO", "STAGE", "STGABC" , "DIAGDATE", "DIAGDATE_MONTH", "DIAGDATE_YEAR", "DTPROG1", "DTPROG1_MONTH","DTPROG1_YEAR", "DOD", "DTALIVE", "REGDATE", "AGE", "BIOPDT")

merged <- full_join(bas,flchm ,by="TRIALNO") %>% 
  full_join(flrsp,by="TRIALNO")  %>%
  full_join(dnf,by="TRIALNO") %>%
  full_join(reg,by="TRIALNO") %>% 
  select(one_of(tokeep)) %>%
  distinct

# format columns containing date information into date type
datecols <- c("DIAGDATE", "DTPROG1", "DOD", "DTALIVE", "REGDATE")
merged[datecols] = lapply(merged[datecols], as.Date, "%m/%d/%Y")

# fill in missing DIAGDATE and DTPROG1 values as 1st of the month where only month and year info is available
idx <- which(is.na(merged$DIAGDATE) & !is.na(merged$DIAGDATE_MONTH) & !is.na(merged$DIAGDATE_YEAR))
dates <- paste0(merged$DIAGDATE_MONTH[idx],merged$DIAGDATE_YEAR[idx])
merged$DIAGDATE[idx] <- as.Date(as.yearmon(dates, format = "%b%Y"))
idx.dtprog1 <- which(is.na(merged$DTPROG1) & !is.na(merged$DTPROG1_MONTH)& !is.na(merged$DTPROG1_YEAR))
dates.dtprog1 <- paste0(merged$DTPROG1_MONTH[idx.dtprog1],merged$DTPROG1_YEAR[idx.dtprog1])
merged$DTPROG1[idx.dtprog1] <- as.Date(as.yearmon(dates.dtprog1, format = "%b%Y"))

# add last documented clinical assessment (DTLAST) date and survival status (STATUS) columns as of 1 December 2016
# DTLAST: DTALIVE or DOD
# STATUS: a death event was recorded as 1 and survival as 0  
merged$DTLAST <- merged$DTALIVE
survidx <- which(!is.na(merged$DOD))
merged$DTLAST[survidx] <- merged$DOD[survidx]
merged$STATUS <- 1  
merged$STATUS[survidx] <- 0

# overall survival in BriTROC-1 patients was calculated from the date of enrolment to the date of death or the last documented clinical assessment, with data cutoff at 1 December 2016.
merged$PFS <- difftime(merged$DTPROG1, merged$DIAGDATE)
merged$OS <- difftime(merged$DTLAST, merged$DIAGDATE)

merged$INT_START <- difftime(merged$REGDATE,merged$DIAGDATE)
merged$INT_END <- difftime(merged$DTLAST,merged$DIAGDATE)

# remove erroneous rows with negative end dates
merged <- filter(merged,INT_START < INT_END)

out <- merged %>%
  select(TRIALNO, AGE, STATUS, PFS, OS, INT_START, INT_END)
write.table(out, file = "data/britroc_survival_intervals.tsv", sep = "\t", quote = F, row.names = F)

Given that the BritROC study only enrolled patients with relapsed disease, it was necessary to left truncate the overall survival time. In addition, cases where the patient was not deceased were right censored. We combined the BritROC, PCAWG and TCGA cases and fit a Cox proportinal hazards model. We stratifed by study (BritROC, OV-US, OV-AU and TCGA).

Compiling training and test cohorts

library(survival)
set.seed(seed)
samp_num<-0.7 #fraction of the data considered training

tcga_clin<-read.table("data/tcga_sample_info.tsv",stringsAsFactors = F,header=T,sep="\t")
pcawg_clin<-read.table("data/pcawg_sample_info.tsv",sep="\t",header=T,stringsAsFactors = F)
pcawg_cohort<-read.table("data/pcawg_cohort_info.tsv",header=T,sep="\t",stringsAsFactors = F)

#britroc survival data
pat_info<-samp_annotation[,c("Britroc_No","IM.JBLAB_ID","star_rating")]
pat_info<-samp_annotation[grepl("IM",samp_annotation$IM.JBLAB_ID),]
pat_info<-pat_info[order(pat_info$Britroc_No,pat_info$star_rating,decreasing = T),]
pat_info<-pat_info[!duplicated(pat_info$Britroc_No),]
pat_info<-pat_info[!pat_info$IM.JBLAB_ID%in%c("IM_91","IM_70"),]#remove misclassified samples
surv_dat<-read.table("data/britroc_survival_intervals.tsv",sep="\t",header=T,stringsAsFactors = F)
surv_dat<-surv_dat[!duplicated(surv_dat$TRIALNO),]
britroc_surv_dat<-merge(surv_dat,pat_info,by.x=1,by.y=1)
rownames(britroc_surv_dat)<-britroc_surv_dat$IM.JBLAB_ID
britroc_surv_dat<-britroc_surv_dat[colnames(sig_pat_mat_britroc_all)[colnames(sig_pat_mat_britroc_all)%in%rownames(britroc_surv_dat)],]
britroc_surv_dat<-britroc_surv_dat[!(is.na(britroc_surv_dat$INT_START)|is.na(britroc_surv_dat$INT_END)),]
britroc_surv_dat<-britroc_surv_dat[!(britroc_surv_dat$INT_START>britroc_surv_dat$INT_END),]
britroc_train<-rownames(britroc_surv_dat)
britroc_test<-rownames(britroc_surv_dat)[!rownames(britroc_surv_dat)%in%britroc_train]

#pcawg survival data
pdat<-pcawg_cohort[pcawg_cohort$tumor_wgs_aliquot_id%in%colnames(sig_pat_mat_pcawg),c("tumor_wgs_aliquot_id","icgc_donor_id","dcc_project_code")]
pdat<-merge(pdat,pcawg_clin[,c("icgc_donor_id","donor_vital_status","donor_survival_time","donor_interval_of_last_followup","donor_age_at_diagnosis")],by.x=2,by.y=1)
pdat<-pdat[!duplicated(pdat$tumor_wgs_aliquot_id),]
rownames(pdat)<-pdat$tumor_wgs_aliquot_id
pdat<-pdat[colnames(sig_pat_mat_pcawg),]
pcawg_surv_dat<-pdat
pcawg_os<-pcawg_surv_dat$donor_survival_time
pcawg_os[is.na(pcawg_os)]<-pcawg_surv_dat$donor_interval_of_last_followup[is.na(pcawg_os)]
pcawg_surv_dat$os<-pcawg_os
pcawg_event<-pcawg_surv_dat$donor_vital_status
pcawg_event<-plyr::revalue(pcawg_event,c("deceased"=1,"alive"=0))
pcawg_surv_dat$event<-pcawg_event
pcawg_train<-sample(rownames(pcawg_surv_dat),round(samp_num*nrow(pcawg_surv_dat)))
pcawg_test<-rownames(pcawg_surv_dat)[!rownames(pcawg_surv_dat)%in%pcawg_train]

#tcga survival data
tcga_surv_dat<-tcga_clin
tcga_surv_dat<-tcga_surv_dat[!is.na(tcga_surv_dat$age_at_initial_pathologic_diagnosis),]
tcga_surv_dat<-tcga_surv_dat[tcga_surv_dat$bcr_aliquot_barcode%in%colnames(sig_pat_mat_tcga),]
tcga_surv_dat<-tcga_surv_dat[!is.na(tcga_surv_dat$days_to_last_followup),]
tcga_survival_sigmat<-sig_pat_mat_tcga
tcga_os<-tcga_surv_dat$days_to_death
tcga_status<-!is.na(tcga_os)
tcga_surv_dat$status<-tcga_status
tcga_os[is.na(tcga_os)]<-tcga_surv_dat$days_to_last_followup[is.na(tcga_os)]
tcga_surv_dat$os<-tcga_os
tcga_train<-sample(rownames(tcga_surv_dat),round(samp_num*nrow(tcga_surv_dat)))
tcga_test<-rownames(tcga_surv_dat)[!rownames(tcga_surv_dat)%in%tcga_train]


combined_sig_pat_mat<-cbind(sig_pat_mat_britroc_all[,rownames(britroc_surv_dat)],
                            sig_pat_mat_pcawg[,rownames(pcawg_surv_dat)],
                            sig_pat_mat_tcga[,tcga_surv_dat$bcr_aliquot_barcode])

#threshold low exposure values to 2%
combined_sig_pat_mat[combined_sig_pat_mat<=0.02]<-0.02

combined_os_survival<-survival::Surv(time= c(britroc_surv_dat$INT_START,
                                   rep(1,length(pcawg_os)),
                                   rep(1,length(tcga_os))),
                           time2=c(britroc_surv_dat$INT_END,
                                   pcawg_surv_dat$os,
                                   tcga_surv_dat$os),
                           event=c(!britroc_surv_dat$STATUS,
                                   as.numeric(pcawg_surv_dat$event)==1,
                                   tcga_surv_dat$status))
survival_ids<-c(rownames(britroc_surv_dat),rownames(pcawg_surv_dat),rownames(tcga_surv_dat))

sig_data<-data.frame(t(combined_sig_pat_mat),
                     study=c(rep("britroc",nrow(britroc_surv_dat)),
                             pcawg_surv_dat$dcc_project_code,rep("TCGA",nrow(tcga_surv_dat))),
                     age=c(britroc_surv_dat$AGE,
                           pcawg_surv_dat$donor_age_at_diagnosis,
                           tcga_surv_dat$age_at_initial_pathologic_diagnosis),
                     stringsAsFactors = F)
#stratify by age
sig_data$age.cat<-car::recode(as.numeric(sig_data$age), "lo:39=1; 40:44=2; 45:49=3; 50:54=4; 55:59=5; 60:64=6; 65:69=7; 70:74=8; 75:79=9; 80:hi=10")

As the signatures sum to 1, in order to perform survival analysis the exposures had to be normalised using the signature with the lowest variance:

#determine signature with lowest variance
which.min(apply(combined_sig_pat_mat,1,var))
## s5 
##  5
#normalise signature exposures
sig_data_unorm<-sig_data
sig_data[,1:nsig]<-apply(sig_data[,1:nsig],2,function(x){log2(mapply("/",x,sig_data[,5]))})

#split data into training and test cohorts
sig_data_train<-sig_data[survival_ids%in%c(britroc_train,pcawg_train,tcga_train),]
sig_data_test<-sig_data[survival_ids%in%c(britroc_test,pcawg_test,tcga_test),]
combined_os_survival_train<-combined_os_survival[survival_ids%in%c(britroc_train,pcawg_train,tcga_train)]
combined_os_survival_test<-combined_os_survival[survival_ids%in%c(britroc_test,pcawg_test,tcga_test)]

Fitting a cox-proportional hazard model (training)

Here we fit a cox proportional hazards model to predict survival with the normlised signature exposures as covariates, stratified by study and age.

os_coxph<-coxph(combined_os_survival_train ~ s1+s2+s3+s4+s6+s7 + strata(study, age.cat,na.group=T), data = sig_data_train)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival_train ~ s1 + s2 + s3 + s4 + 
##     s6 + s7 + strata(study, age.cat, na.group = T), data = sig_data_train)
## 
##   n= 417, number of events= 241 
## 
##         coef exp(coef)  se(coef)      z Pr(>|z|)    
## s1  0.167312  1.182123  0.050049  3.343 0.000829 ***
## s2  0.172932  1.188786  0.064564  2.678 0.007396 ** 
## s3 -0.139257  0.870004  0.059735 -2.331 0.019740 *  
## s4 -0.017087  0.983058  0.056725 -0.301 0.763236    
## s6  0.001405  1.001406  0.051779  0.027 0.978353    
## s7 -0.174542  0.839842  0.053591 -3.257 0.001126 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##    exp(coef) exp(-coef) lower .95 upper .95
## s1    1.1821     0.8459    1.0717    1.3040
## s2    1.1888     0.8412    1.0475    1.3492
## s3    0.8700     1.1494    0.7739    0.9781
## s4    0.9831     1.0172    0.8796    1.0987
## s6    1.0014     0.9986    0.9048    1.1084
## s7    0.8398     1.1907    0.7561    0.9329
## 
## Concordance= 0.567  (se = 0.092 )
## Rsquare= 0.048   (max possible= 0.881 )
## Likelihood ratio test= 20.58  on 6 df,   p=0.002179
## Wald test            = 20.53  on 6 df,   p=0.002229
## Score (logrank) test = 20.95  on 6 df,   p=0.001869
#test for proportional hazards
cox.zph(os_coxph)
##            rho chisq      p
## s1      0.0380 0.282 0.5952
## s2      0.1482 5.765 0.0163
## s3     -0.1094 2.659 0.1030
## s4     -0.0235 0.129 0.7194
## s6     -0.0356 0.292 0.5890
## s7     -0.0453 0.432 0.5112
## GLOBAL      NA 6.980 0.3227

Note: exp(coef) is the Hazard ratio.

Testing the cox-proportional hazard model (test)

Here we tested the ability of the cox proportional hazards model to predict overall survival using the concordance index.

#predict survival using coxph model
train_predict<-predict(os_coxph,type="risk")
test_predict<-predict(os_coxph,newdata=sig_data_test,type="risk")

perf <- survcomp::concordance.index(x = test_predict, surv.time = combined_os_survival_test[,"stop"],
                          surv.event = combined_os_survival_test[, "status"], method = "noether",na.rm = TRUE)
perf[1:5]
## $c.index
## [1] 0.5597025
## 
## $se
## [1] 0.03012423
## 
## $lower
## [1] 0.5006601
## 
## $upper
## [1] 0.6187449
## 
## $p.value
## [1] 0.04749321

Fitting cox-proportional hazard model (all data)

We then fit a second cox proportional hazards model to the combined test and training data.

os_coxph<-coxph(combined_os_survival~ s1+s2+s3+s4+s6+s7 + strata(study, age.cat,na.group=T), data = sig_data)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival ~ s1 + s2 + s3 + s4 + s6 + 
##     s7 + strata(study, age.cat, na.group = T), data = sig_data)
## 
##   n= 575, number of events= 335 
## 
##         coef exp(coef)  se(coef)      z Pr(>|z|)    
## s1  0.138814  1.148910  0.041584  3.338 0.000843 ***
## s2  0.109738  1.115986  0.050336  2.180 0.029250 *  
## s3 -0.094791  0.909563  0.048213 -1.966 0.049289 *  
## s4 -0.001757  0.998244  0.047431 -0.037 0.970445    
## s6  0.014342  1.014446  0.042268  0.339 0.734369    
## s7 -0.122635  0.884586  0.044438 -2.760 0.005786 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##    exp(coef) exp(-coef) lower .95 upper .95
## s1    1.1489     0.8704    1.0590    1.2465
## s2    1.1160     0.8961    1.0111    1.2317
## s3    0.9096     1.0994    0.8275    0.9997
## s4    0.9982     1.0018    0.9096    1.0955
## s6    1.0144     0.9858    0.9338    1.1021
## s7    0.8846     1.1305    0.8108    0.9651
## 
## Concordance= 0.544  (se = 0.072 )
## Rsquare= 0.036   (max possible= 0.928 )
## Likelihood ratio test= 20.87  on 6 df,   p=0.001933
## Wald test            = 20.72  on 6 df,   p=0.002062
## Score (logrank) test = 20.99  on 6 df,   p=0.001841
#check for non-proportional hazards
cox.zph(os_coxph)
##             rho  chisq      p
## s1      0.00991 0.0302 0.8620
## s2      0.10822 4.4903 0.0341
## s3     -0.06856 1.4517 0.2283
## s4      0.02384 0.1926 0.6607
## s6     -0.02723 0.2470 0.6192
## s7     -0.04619 0.6618 0.4159
## GLOBAL       NA 6.7374 0.3458
#Compute hazard ratio table
predicted_survival<-predict(os_coxph,type="risk")

HR <- round(exp(coef(os_coxph)), 2)
CI <- round(exp(confint(os_coxph)), 2)
P <- round(coef(summary(os_coxph))[,5], 3)
colnames(CI) <- c("Lower", "Higher")
hazard_table <- as.data.frame(cbind(HR, CI, P))
hazard_table<-cbind(Signature=row.names(hazard_table),hazard_table)
knitr::kable(hazard_table)
Signature HR Lower Higher P
s1 s1 1.15 1.06 1.25 0.001
s2 s2 1.12 1.01 1.23 0.029
s3 s3 0.91 0.83 1.00 0.049
s4 s4 1.00 0.91 1.10 0.970
s6 s6 1.01 0.93 1.10 0.734
s7 s7 0.88 0.81 0.97 0.006

Unsupervised clustering of patients using signature exposures

To group patients based on their copy-number signature exposures we used the NBClust package to determine the number of clusters present in the data and cluster the patients.

mydata<-t(combined_sig_pat_mat[,])
#determine number of clusters
fit <- NbClust::NbClust(mydata,method="ward.D2")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 1 proposed 2 as the best number of clusters 
## * 12 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 11 as the best number of clusters 
## * 2 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
#visualise signature exposures within clusters
sig_data$clust=fit$Best.partition

pdat<-data.frame(t(combined_sig_pat_mat))
pdat$clust<-fit$Best.partition
pdat$sample<-rownames(pdat)
pdat<-reshape2::melt(pdat,id.vars=c(8,9))
pdat$variable<-plyr::revalue(pdat$variable,
                            c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
ggplot(pdat,aes(x=variable,y=value))+geom_boxplot()+geom_jitter(alpha=0.1)+facet_grid(clust ~ .)+my_theme+xlab("Signature")+ylab("Exposure")

Clustering membership plot

pdat$sample<-factor(pdat$sample,levels=names(predicted_survival)[order(predicted_survival,decreasing=T)])
pdat$clust<-factor(pdat$clust,levels=3:1)
clust_plot<-ggplot(pdat,aes(y=clust,x=sample))+geom_tile()+my_theme+theme(axis.text.x =element_blank(),axis.ticks.x = element_blank())+
  xlab("")+ylab("")+scale_fill_discrete()+theme(legend.position = "none")
clust_plot

Fitting cox-proportional hazard model for clusters

We also fit a cox-proportional hazards model using the clusters as covariates.

sig_data$clust<-factor(sig_data$clust,levels=3:1)
os_coxph<-coxph(combined_os_survival~ clust + strata(study, age.cat), data = sig_data)
summary(os_coxph)
## Call:
## coxph(formula = combined_os_survival ~ clust + strata(study, 
##     age.cat), data = sig_data)
## 
##   n= 575, number of events= 335 
## 
##           coef exp(coef) se(coef)     z Pr(>|z|)   
## clust2 0.53158   1.70161  0.17598 3.021  0.00252 **
## clust1 0.05604   1.05764  0.14193 0.395  0.69296   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## clust2     1.702     0.5877    1.2052     2.402
## clust1     1.058     0.9455    0.8008     1.397
## 
## Concordance= 0.518  (se = 0.063 )
## Rsquare= 0.017   (max possible= 0.928 )
## Likelihood ratio test= 10.04  on 2 df,   p=0.006602
## Wald test            = 10.71  on 2 df,   p=0.00473
## Score (logrank) test = 10.89  on 2 df,   p=0.004324
cox.zph(os_coxph)
##           rho chisq     p
## clust2 0.0614 1.209 0.271
## clust1 0.0414 0.548 0.459
## GLOBAL     NA 1.261 0.532

While this model showed significant predictive preformance, the exposures across the clusters (above) reveal that cluster 2 is based primarily on signal from signature 1 exposures, and thus a cluster based model does not add information above and beyond the signature model.

Visualising cox-proportional hazard model

Association analysis

Mutated pathways and driver genes

We combined all three cohorts to test if mutant carriers had significantly higher signature exposures compared to wild-type carriers for known cancer causing genes.

Prepararation of driver mutation list for input into Cancer Genome Interpreter

library(VariantAnnotation)
#this chunk of code is not run be default due to data restrictions. Please download the required files (see below) to execute this segment of code.
all_mutations<-c()

#TCGA cohort
#download Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0 from Broad GDAC https://gdac.broadinstitute.org/
tcga_ids<-colnames(sig_pat_mat_tcga)
tcga_ids_missing<-c()
for(i in tcga_ids)
{
    fn<-paste0("restricted_data/gdac.broadinstitute.org_OV.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0/",
            substr(i,1,15),".maf.txt")
  if(file.exists(fn))
  {
  curr_mut<-read.table(fn,sep="\t",header=T,stringsAsFactors = F)
  all_mutations<-rbind(all_mutations,cbind(i,curr_mut$Chromosome,curr_mut$Start_position,as.character(curr_mut$Reference_Allele),as.character(curr_mut$Tumor_Seq_Allele2)))
  }else{
    tcga_ids_missing<-c(tcga_ids_missing,i)
  }
}
colnames(all_mutations)<-c("sample","chromosome","position","ref","alt")
tcga_mutations<-all_mutations

#PCAWG cohort
#download the final_consensus_12oct_passonly_snv_indel.tar variant calls from PCAWG then run the following commands:
#grep -v -E 'IGR|Silent|RNA|lincRNA|Intron|UTR|Flank|Splice' October_2016_ovary.snv_indel.maf > October_2016_ovary_deleterious.snv_indel.maf
#grep -v IGR October_2016_ovary.snv_indel.maf | cut -f13,2,3,8,10 - |  awk '{print $5 "\t" $1 "\t" $2 "\t" $3 "\t" $4 }' - > October_2016_ovary.snv_indel.CGIformat
pcawg_ids<-colnames(sig_pat_mat_pcawg)
pcawg_mut_table<-read.table("restricted_data/October_2016_ovary.snv_indel.CGIformat",header=T,sep=",",stringsAsFactors=F)
colnames(pcawg_mut_table)<-c("sample","chromosome","position","ref","alt")
pcawg_ids_missing<-pcawg_ids[!pcawg_ids%in%pcawg_mut_table$sample]
pcawg_mut_table<-pcawg_mut_table[pcawg_mut_table$sample%in%pcawg_ids,]
pcawg_mut_table<-as.matrix(pcawg_mut_table)
all_mutations<-rbind(all_mutations,pcawg_mut_table)
pcawg_mutations<-pcawg_mut_table

#BritROC cohort
#download VCFs from xxxxxxx and put in restricted_data/SNVs directory
isDeleterious<-function(x){
  grepl(paste("3_prime_UTR_variant",
"5_prime_UTR_premature_start_codon_gain_variant",
"5_prime_UTR_variant",
"downstream_gene_variant",
"initiator_codon_variant",
#"intergenic_region",
"intragenic_variant",
"intron_variant",
"missense_variant",
"missense_variant&splice_region_variant",
"non_coding_transcript_exon_variant",
"splice_acceptor_variant&intron_variant",
"splice_donor_variant&intron_variant",
"splice_region_variant",
"splice_region_variant&intron_variant",
"splice_region_variant&non_coding_transcript_exon_variant",
"splice_region_variant&synonymous_variant",
"start_lost",
"start_lost&splice_region_variant",
"stop_gained",
"stop_gained&splice_region_variant",
"stop_lost",
"stop_lost&splice_region_variant",
"stop_retained_variant",
"synonymous_variant",
"upstream_gene_variant",
sep="|"),x)}
isPASS<-function(x){
  grepl("PASS",x,fixed=TRUE)}
prefilters <- FilterRules(list(PASS=isPASS,deleterious=isDeleterious))

britroc_mutations<-c()
library(VariantAnnotation)
for(i in 1:nrow(britroc_ids))
{
  study_id<-as.character(britroc_ids[i,1])
  jblab_id<-as.character(britroc_ids[i,2])
  filename<-paste0("restricted_data/SNVs/",study_id,".snv.filters.blacklist.vcf.gz")
  if(file.exists(filename))
  {
  tabix.file <- TabixFile(filename, yieldSize=10000)
  destination.file<-paste0("restricted_data/SNVs/",study_id,".snv.deleterious.vcf")
  if(!file.exists(destination.file))
  {
    filterVcf(tabix.file,"hg19", destination.file, prefilters=prefilters, verbose=TRUE)
  }
  curr_vcf<-readVcf(destination.file)
  
  britroc_mutations<-rbind(britroc_mutations,cbind(jblab_id,as.character(seqnames(curr_vcf)),
          as.character(start(curr_vcf)),
          as.character(ref(curr_vcf)),
          as.character(unlist(alt(curr_vcf)))))
  }
}
all_mutations<-rbind(all_mutations,britroc_mutations)

all_mutations[,2]<-paste0("chr",all_mutations[,2])
colnames(all_mutations)<-c("sample","chr","pos","ref","alt")
all_mutations[,3]<-as.numeric(all_mutations[,3])

#calculate loss and amplification for CGI
CGI_cancer_genes<-read.table("data/cancer_genes_upon_mutations_or_CNAs.tsv",sep="\t",stringsAsFactors = F,header=T)
CGI_cancer_genes<-unique(CGI_cancer_genes$gene)

#get gene TSS
library(org.Hs.eg.db)
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
loc<-TxDb.Hsapiens.UCSC.hg19.knownGene
CGI_genemap<-select(org.Hs.eg.db, CGI_cancer_genes, c("ENTREZID","GENENAME"), "ALIAS")
gene_locs<-select(loc,CGI_genemap$ENTREZID,c("GENEID","TXCHROM","TXSTART","TXEND"),"GENEID")
gene_locs<-gene_locs[!duplicated(gene_locs$GENEID),]

gene_coords<-c()
for(i in CGI_cancer_genes)
{
  currg<-CGI_genemap[CGI_genemap$ALIAS==i,"ENTREZID"]
  if(length(currg)==1)
  {
    curr_loc<-gene_locs[gene_locs$GENEID==currg,]
  }
  else
  {
    for(j in currg)
    {
      curr_loc<-gene_locs[gene_locs$GENEID==j,]
      if(!is.na(curr_loc[1,2]))
      {
        break
      }
    }
  }
  if(!is.null(curr_loc))
  {
    chr<-curr_loc[1,"TXCHROM"]
    tss<-abs(curr_loc[1,"TXSTART"])
    tss_end<-abs(curr_loc[1,"TXEND"])
    gene_coords<-rbind(gene_coords,as.character(c(i,chr,tss,tss_end)))
  }
}
colnames(gene_coords)<-c("gene","chr","tss","tss_end")
gene_coords<-data.frame(gene_coords,stringsAsFactors = F)
gene_coords$chr<-substring(gene_coords$chr,4)
gene_coords<-gene_coords[gene_coords$chr%in%c(1:22),]

britroc_segTab<-c()
britroc_segTab_remain<-c()

for(i in 1:ncol(hq_CN))
{
  britroc_segTab[[colnames(hq_CN[,i])]]<-getSegTable(hq_CN[,i])
}
for(i in 1:ncol(lq_CN))
{
  britroc_segTab[[colnames(lq_CN[,i])]]<-getSegTable(lq_CN[,i])
}
for(i in 1:ncol(remain_CN))
{
  britroc_segTab_remain[[colnames(remain_CN[,i])]]<-getSegTable(remain_CN[,i])
}
all_segTabs<-c(britroc_segTab,pcawg_segTabs,tcga_segTabs,britroc_segTab_remain)

cn_mat<-c()
for(j in 1:nrow(gene_coords))
{
  res<-c()
  name<-gene_coords[j,"gene"]
  chr<-gene_coords[j,"chr"]
  tss<-abs(as.numeric(gene_coords[j,"tss"]))
  for(i in colnames(all_sig))
  {
    segTable<-all_segTabs[[i]]
    cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<tss&as.numeric(segTable[,3])>tss,4])
    if(!length(cn)==1)
    {
      modtss<-tss+1000000
      cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<modtss&as.numeric(segTable[,3])>modtss,4])
      if(!length(cn)==1)
      {
      cn<-NA
      }
    }
    res<-c(res,cn)
  }
  cn_mat<-rbind(cn_mat,c(name,res))
}
colnames(cn_mat)<-c("Gene",colnames(all_sig))
rownames(cn_mat)<-cn_mat[,1]
cn_mat<-cn_mat[,-1]

del<-apply(cn_mat,1,function(x){
  x<-x[!is.na(x)]
  names(x[as.numeric(x)<0.4])})
del<-del[unlist(lapply(del,length))>0]

amp<-apply(cn_mat,1,function(x){
  x<-x[!is.na(x)]
  names(x[as.numeric(x)>5])})
amp<-amp[unlist(lapply(amp,length))>0]

cn_forCGI<-c()
for(i in names(del))
{
  cn_forCGI<-rbind(cn_forCGI,cbind(del[[i]],i,"del"))
}
for(i in names(amp))
{
  cn_forCGI<-rbind(cn_forCGI,cbind(amp[[i]],i,"amp"))
}
colnames(cn_forCGI)<-c("sample","gene","cna")
write.table(cn_forCGI,"cn_for_CGI.txt",quote=F,row.names=F,sep="\t")

gene_coords_gr<-gene_coords
colnames(gene_coords_gr)<-c("gene","chr","start","end")
gene_coords_gr$chr<-paste0("chr",gene_coords_gr$chr)
gene_coords_gr<-makeGRangesFromDataFrame(gene_coords_gr,seqinfo = paste0("chr",c(1:22,"X","Y")))

colnames(tcga_mutations)<-colnames(all_mutations)
library(rtracklayer)

tcga_mutations<-data.frame(tcga_mutations)
coords_to_liftover<-data.frame(chr=paste0("chr",tcga_mutations$chr),start=tcga_mutations$pos,end=tcga_mutations$pos,tcga_mutations[,c(1,4,5)])
gr<-makeGRangesFromDataFrame(coords_to_liftover,keep.extra.columns = TRUE,ignore.strand = TRUE)
chain<-import.chain("data/hg18ToHg19.over.chain")
gr_hg19<-liftOver(gr,chain)
gr_hg19<-subsetByOverlaps(unlist(gr_hg19),gene_coords_gr)
tcga_mutations_hg19<-data.frame(gr_hg19)[,c("sample","seqnames","start","ref","alt")]
colnames(tcga_mutations_hg19)<-c("sample","chr","pos","ref","alt")
tcga_mutations_hg19$chr<-substring(tcga_mutations_hg19$chr,4)
write.table(tcga_mutations_hg19,"tcga_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")

britroc_mutations<-data.frame(britroc_mutations)
colnames(britroc_mutations)<-colnames(all_mutations)
britroc_mutations_gr<-data.frame(chr=paste0("chr",britroc_mutations$chr),start=britroc_mutations$pos,end=britroc_mutations$pos,britroc_mutations[,c(1,4,5)])
britroc_mutations_gr<-makeGRangesFromDataFrame(britroc_mutations_gr,keep.extra.columns = TRUE,ignore.strand = TRUE)
britroc_mutations_gr<-subsetByOverlaps(unlist(britroc_mutations_gr),gene_coords_gr)
britroc_mutations<-data.frame(britroc_mutations_gr)[,c("sample","seqnames","start","ref","alt")]
colnames(britroc_mutations)<-c("sample","chr","pos","ref","alt")
britroc_mutations$chr<-substring(britroc_mutations$chr,4)
write.table(britroc_mutations,"britroc_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")

pcawg_mutations<-data.frame(pcawg_mutations)
colnames(pcawg_mutations)<-colnames(all_mutations)
pcawg_mutations_gr<-data.frame(chr=paste0("chr",pcawg_mutations$chr),start=pcawg_mutations$pos,end=pcawg_mutations$pos,pcawg_mutations[,c(1,4,5)])
pcawg_mutations_gr<-makeGRangesFromDataFrame(pcawg_mutations_gr,keep.extra.columns = TRUE,ignore.strand = TRUE)
pcawg_mutations_gr<-subsetByOverlaps(unlist(pcawg_mutations_gr),gene_coords_gr)
pcawg_mutations<-data.frame(pcawg_mutations_gr)[,c("sample","seqnames","start","ref","alt")]
colnames(pcawg_mutations)<-c("sample","chr","pos","ref","alt")
pcawg_mutations$chr<-substring(pcawg_mutations$chr,4)
write.table(pcawg_mutations,"pcawg_mutations_for_CGI.txt",quote=F,row.names=F,sep="\t")

To run the Cancer Genome Interpreter analysis use the python script launch_CGI.py located in the scripts directory.

Extraction of pathway list from Reactome enriched for mutated drivers

#after the above analysis is complete, results should be downloaded into the following directories:
#PCAWG:restricted_data/pcawg/
#TCGA:restricted_data/tcga/
#BRITROC:restricted_data/britroc/
#CNAs:resticted_data/cnas
CGImuts_pcawg<-read.table("restricted_data/pcawg/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts_tcga<-read.table("restricted_data/tcga/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts_britroc<-read.table("restricted_data/britroc/mutation_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGImuts<-rbind(CGImuts_britroc,CGImuts_pcawg,CGImuts_tcga)
CGImuts<-CGImuts[CGImuts$driver_mut_prediction%in%c("TIER 1","TIER 2"),]
write.table(CGImuts,"mutated_cases.tsv",row.names = F,sep="\t")

CGIcna<-read.table("restricted_data/cnas/cna_analysis.tsv",header=T,sep="\t",stringsAsFactors = F)
CGIcna<-CGIcna[!CGIcna$driver_statement%in%c("predicted passenger"),]
write.table(CGIcna,"del_amp_cases.tsv",row.names = F,sep="\t")

CGIcna<-CGIcna[CGIcna$sample%in%all_mutations[,1],]

all_mut<-list()
for(i in unique(CGImuts$sample))
{
  all_mut[[i]]<-CGImuts[CGImuts$sample==i,"gene"]
}
for(i in unique(CGIcna$sample))
{
  if(is.null(all_mut[[i]]))
  {
    all_mut[[i]]<-CGIcna[CGIcna$sample==i,"gene"]
  }
  else
  {
    all_mut[[i]]<-c(all_mut[[i]],CGIcna[CGIcna$sample==i,"gene"])
  }
}
all_mut<-all_mut[names(all_mut)%in%colnames(all_sig)]

#contruct mutation matrix
all_genes<-unique(as.character(unlist(all_mut)))
all_genes<-all_genes[!is.na(all_genes)]

library(ReactomePA)
library(org.Hs.eg.db)
genemap<-AnnotationDbi::select(org.Hs.eg.db, all_genes, c("ENTREZID","GENENAME"), "ALIAS")
genemap<-genemap[!duplicated(genemap$ALIAS),]
x <- enrichPathway(gene=genemap$ENTREZID,pvalueCutoff=0.05,qvalueCutoff=0.05,readable = T)
enriched_pathways<-x
x<-as.data.frame(x)
x<-x[x$Count>=5,]

pathway_genes<-list()
for(i in 1:nrow(x))
{
  gene_ratio<-as.numeric(unlist(strsplit(x[i,]$GeneRatio,"/")))
  bg_ratio<-as.numeric(unlist(strsplit(x[i,]$BgRatio,"/")))
  if((gene_ratio[1]/bg_ratio[1])>0.05)
  {
    pathway_genes[[x[i,]$ID]]<-unlist(strsplit(x[i,]$geneID,"/"))
  }
}
all_genes<-all_genes[all_genes%in%unique(unlist(pathway_genes))]
all_genes<-all_genes[!all_genes=="TP53"]

mut_matrix<-c()
for(g in all_genes)
{
  mut_matrix<-cbind(mut_matrix,unlist(lapply(all_mut,function(x){sum(x%in%g)>0})))
}
colnames(mut_matrix)<-all_genes
mut_matrix<-mut_matrix[,colSums(mut_matrix)>=1]
mut_matrix<-data.frame(mut_matrix)
all_sequenced_patients<-unique(all_mutations[,1])
samples_to_add<-all_sequenced_patients[!all_sequenced_patients%in%rownames(mut_matrix)]
unmut<-rep(FALSE,length(samples_to_add)*ncol(mut_matrix))
dim(unmut)<-c(length(samples_to_add),ncol(mut_matrix))
rownames(unmut)<-samples_to_add
colnames(unmut)<-colnames(mut_matrix)
mut_matrix<-rbind(mut_matrix,unmut)
mut_matrix<-mut_matrix[rownames(mut_matrix)%in%colnames(all_sig),]

#request the following files from the DACO for this project: HRstatus.csv and somaticHRD.csv
HRmuts<-read.table("restricted_data/HRstatus.csv",header=T,stringsAsFactors = F,sep=",")
HRmuts$Gene[is.na(HRmuts$Gene)]<-"WT"
BRCA1_mutants<-britroc_ids[britroc_ids[,1]%in%HRmuts$BritRoc.No[HRmuts$Gene=="BRCA1"],2]
BRCA2_mutants<-britroc_ids[britroc_ids[,1]%in%HRmuts$BritRoc.No[HRmuts$Gene=="BRCA2"],2]
somaticBRCA<-read.table("restricted_data/somaticHRD.csv",header=T,stringsAsFactors = F,sep=",")
BRCA1_mutants<-c(BRCA1_mutants,britroc_ids[britroc_ids[,2]%in%somaticBRCA[somaticBRCA$Symbol..Gene.ID.=="BRCA1","Sample"],2])
BRCA2_mutants<-c(BRCA2_mutants,britroc_ids[britroc_ids[,2]%in%somaticBRCA[somaticBRCA$Symbol..Gene.ID.=="BRCA2","Sample"],2])
mut_matrix[rownames(mut_matrix)%in%BRCA1_mutants,"BRCA1"]<-TRUE
mut_matrix[rownames(mut_matrix)%in%BRCA2_mutants,"BRCA2"]<-TRUE

#get germline BRCA mutations for TCGA
BRCA1<-c("TCGA-10-0931",
         "TCGA-13-1408","TCGA-23-102","TCGA-23-1118","TCGA-23-2078","TCGA-23-2079","TCGA-13-0887",
         "TCGA-13-1494","TCGA-13-0893","TCGA-13-0903","TCGA-61-2109","TCGA-04-1356","TCGA-59-2348",
         "TCGA-13-1512","TCGA-09-1669","TCGA-25-2392","TCGA-24-2298","TCGA-24-1470","TCGA-57-1582",
         "TCGA-09-2051","TCGA-13-0883","TCGA-23-1122","TCGA-23-2077","TCGA-23-2081","TCGA-25-2401",
         "TCGA-09-2045","TCGA-61-2008")

BRCA2<-c("TCGA-24-0975","TCGA-24-2288","TCGA-13-0900","TCGA-04-1367","TCGA-25-2404","TCGA-24-1463",
         "TCGA-24-1417","TCGA-24-2024","TCGA-04-1336","TCGA-13-0913","TCGA-13-0886","TCGA-13-1498",
         "TCGA-13-1499","TCGA-24-2280","TCGA-59-2351","TCGA-13-0726","TCGA-24-2293","TCGA-24-1562",
         "TCGA-13-1512","TCGA-23-1026")
mut_matrix[substring(rownames(mut_matrix),1,12)%in%BRCA1,"BRCA1"]<-TRUE
mut_matrix[substring(rownames(mut_matrix),1,12)%in%BRCA2,"BRCA2"]<-TRUE

saveRDS(mut_matrix,"data/mutation_matrix.rds")
saveRDS(pathway_genes,"data/pathway_genes.rds")
saveRDS(enriched_pathways,"data/enriched_pathways.rds")
saveRDS(CGImuts[,c("sample","gene","consequence","gene_role")],"data/CGI_mut_role.rds")
saveRDS(CGIcna[,c("sample","gene","cna","gene_role")],"data/CGI_cna_role.rds")

Changes in exposures between cases with mutated pathways versus non-mutated

mut_matrix<-readRDS("data/mutation_matrix.rds")
pathway_genes<-readRDS("data/pathway_genes.rds")
enriched_pathways<-readRDS("data/enriched_pathways.rds")

min_cases_mutated<-24
min_cases_top_two_genes<-5
pathway_mut_matrix<-c()
remove_pathway<-c()
matrix_pathways<-c()
pathway_mutated_genecount<-list()
all_exposures_mut_wt<-c()
for(p in names(pathway_genes))
{
  curr_genes<-pathway_genes[[p]]
  curr_mut_matrix<-as.matrix(mut_matrix[,colnames(mut_matrix)%in%curr_genes])
  if(sum(rowSums(curr_mut_matrix)>0)>=min_cases_mutated&sum(colSums(curr_mut_matrix)>=min_cases_top_two_genes)>=0)
  {
  if(ncol(curr_mut_matrix)>1)
  {
    pathway_mut_matrix<-cbind(pathway_mut_matrix,rowSums(curr_mut_matrix)>0)
    matrix_pathways<-c(matrix_pathways,p)
    pathway_mutated_genecount[[p]]<-colSums(curr_mut_matrix)
  }else if(sum(colnames(mut_matrix)%in%curr_genes)==0){
    remove_pathway<-c(remove_pathway,p)
  }else{
    pathway_mut_matrix<-cbind(pathway_mut_matrix,curr_mut_matrix[,1])
    matrix_pathways<-c(matrix_pathways,p)
    pathway_mutated_genecount[[p]]<-sum(curr_mut_matrix)
  }
  }
}
colnames(pathway_mut_matrix)<-matrix_pathways
mut_ratio_info<-c()
for(g in colnames(pathway_mut_matrix))
{
  mutated<-pathway_mut_matrix[,g]
  for(s in rownames(all_sig))
  {
    mut<-as.numeric(all_sig[s,rownames(pathway_mut_matrix)[mutated]])
    nonmut<-as.numeric(all_sig[s,rownames(pathway_mut_matrix)[!mutated]])
    all_exposures_mut_wt<-rbind(all_exposures_mut_wt,rbind(cbind(s,g,"mut",mut),cbind(s,g,"wt",nonmut)))
    if(median(mut)>median(nonmut))
    {
    mut_ratio_info<-rbind(mut_ratio_info,c(g,s,median(mut)-median(nonmut),wilcox.test(mut,nonmut,alternative="greater")$p.value))
    }
  }
}
colnames(mut_ratio_info)<-c("Type","Signature","mean_diff","pvalue")
mut_ratio_info<-data.frame(mut_ratio_info,stringsAsFactors=F)

mut_ratio_info<-mut_ratio_info[abs(as.numeric(mut_ratio_info$mean_diff))>=0.10,]
mut_ratio_info$qvalue<-p.adjust(mut_ratio_info[,4],method="BH")
mut_ratio_info$Description<-enriched_pathways[mut_ratio_info$Type,"Description"]
mut_ratio_info$Signature<-factor(mut_ratio_info$Signature,levels=paste0("s",1:nsig))
mut_ratio_info$Signature<-plyr::revalue(mut_ratio_info$Signature,
                         c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
mut_ratio_info<-mut_ratio_info[mut_ratio_info$qvalue<0.005,]
mut_ratio_info<-merge(mut_ratio_info,enriched_pathways[,c("ID","geneID")],by.x=1,by.y=1)
mut_ratio_info<-mut_ratio_info[order(mut_ratio_info$Signature,mut_ratio_info$qvalue),]

colnames(all_exposures_mut_wt)<-c("Signature","ID","Status","Exposure")
all_exposures_mut_wt<-data.frame(all_exposures_mut_wt,stringsAsFactors = F)
all_exposures_mut_wt$Description<-enriched_pathways[all_exposures_mut_wt$ID,"Description"]
knitr::kable(mut_ratio_info[,1:6],"html") %>%
    kableExtra::kable_styling(full_width=F) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
Type Signature mean_diff pvalue qvalue Description
172 R-HSA-8849471 1 0.252588740179993 8.46584283845526e-07 0.0000010 PTK6 Regulates RHO GTPases, RAS GTPase and MAP kinases
141 R-HSA-5658442 1 0.246731682237831 4.58409095238899e-06 0.0000051 Regulation of RAS by GAPs
144 R-HSA-5685942 3 0.198180242225367 0.00159530945990504 0.0017180 HDR through Homologous Recombination (HRR)
148 R-HSA-5693579 3 0.198180242225367 0.00159530945990504 0.0017180 Homologous DNA Pairing and Strand Exchange
145 R-HSA-5693537 3 0.196550061541629 0.00259400741114232 0.0027482 Resolution of D-Loop Structures
146 R-HSA-5693554 3 0.196550061541629 0.00259400741114232 0.0027482 Resolution of D-loop Structures through Synthesis-Dependent Strand Annealing (SDSA)
147 R-HSA-5693568 3 0.196550061541629 0.00259400741114232 0.0027482 Resolution of D-loop Structures through Holliday Junction Intermediates
149 R-HSA-5693616 3 0.187943638170493 0.00319193917662721 0.0033635 Presynaptic phase of homologous DNA pairing and strand exchange
150 R-HSA-6785807 4 0.109545051623606 4.00495361286718e-25 0.0000000 Interleukin-4 and 13 signaling
62 R-HSA-2219528 4 0.109095994308619 3.04174513143323e-23 0.0000000 PI3K/AKT Signaling in Cancer
84 R-HSA-3769402 4 0.115938226332199 1.3729603212145e-21 0.0000000 Deactivation of the beta-catenin transactivating complex
76 R-HSA-2730905 4 0.104936071168879 4.19804789394049e-21 0.0000000 Role of LAT2/NTAL/LAB on calcium mobilization
10 R-HSA-1257604 4 0.104317921864257 2.82432941959549e-20 0.0000000 PIP3 activates AKT signaling
41 R-HSA-180292 4 0.104317921864257 2.82432941959549e-20 0.0000000 GAB1 signalosome
87 R-HSA-388841 4 0.103285240682332 1.19224217769441e-19 0.0000000 Costimulation by the CD28 family
56 R-HSA-198203 4 0.103395592455067 1.24618903058137e-19 0.0000000 PI3K/AKT activation
104 R-HSA-452723 4 0.118327541097481 2.83909515508328e-18 0.0000000 Transcriptional regulation of pluripotent stem cells
107 R-HSA-5218920 4 0.102817780424228 1.62863549453764e-17 0.0000000 VEGFR2 mediated vascular permeability
58 R-HSA-199418 4 0.100050735691377 3.19664360147671e-17 0.0000000 Negative regulation of the PI3K/AKT network
3 R-HSA-109704 4 0.100389973795218 1.22450079485844e-16 0.0000000 PI3K Cascade
158 R-HSA-6811558 4 0.108667130091737 2.4115328708105e-16 0.0000000 PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling
64 R-HSA-2219530 4 0.104814160061847 1.17910747115717e-15 0.0000000 Constitutive Signaling by Aberrant PI3K in Cancer
176 R-HSA-8864260 4 0.100850241511579 1.5469307229411e-15 0.0000000 Transcriptional regulation by the AP-2 (TFAP2) family of transcription factors
80 R-HSA-373760 4 0.115456499609437 1.99543690448472e-15 0.0000000 L1CAM interactions
67 R-HSA-2559580 4 0.109615819583484 4.24875054152071e-11 0.0000000 Oxidative Stress Induced Senescence
129 R-HSA-5654727 4 0.109138003141597 6.42384871144081e-11 0.0000000 Negative regulation of FGFR2 signaling
132 R-HSA-5654733 4 0.100210636440278 6.65771113813847e-09 0.0000000 Negative regulation of FGFR4 signaling
69 R-HSA-2559582 4 0.1209934957094 1.60910024609967e-08 0.0000000 Senescence-Associated Secretory Phenotype (SASP)
72 R-HSA-2559585 4 0.10549338574056 4.65118251830628e-08 0.0000001 Oncogene Induced Senescence
102 R-HSA-450294 4 0.14910156336646 1.06327477150069e-07 0.0000001 MAP kinase activation in TLR cascade
18 R-HSA-166166 4 0.147241750288408 1.69755501756731e-07 0.0000002 MyD88-independent TLR3/TLR4 cascade
25 R-HSA-168164 4 0.147241750288408 1.69755501756731e-07 0.0000002 Toll Like Receptor 3 (TLR3) Cascade
177 R-HSA-937061 4 0.147241750288408 1.69755501756731e-07 0.0000002 TRIF-mediated TLR3/TLR4 signaling
14 R-HSA-166054 4 0.136149750110542 1.84027943483452e-07 0.0000002 Activated TLR4 signalling
21 R-HSA-168138 4 0.136149750110542 1.84027943483452e-07 0.0000002 Toll Like Receptor 9 (TLR9) Cascade
31 R-HSA-168181 4 0.136149750110542 1.84027943483452e-07 0.0000002 Toll Like Receptor 7/8 (TLR7/8) Cascade
181 R-HSA-975155 4 0.136149750110542 1.84027943483452e-07 0.0000002 MyD88 dependent cascade initiated on endosome
16 R-HSA-166058 4 0.141579622011902 2.08023669764227e-07 0.0000002 MyD88:Mal cascade initiated on plasma membrane
23 R-HSA-168142 4 0.141579622011902 2.08023669764227e-07 0.0000002 Toll Like Receptor 10 (TLR10) Cascade
27 R-HSA-168176 4 0.141579622011902 2.08023669764227e-07 0.0000002 Toll Like Receptor 5 (TLR5) Cascade
29 R-HSA-168179 4 0.141579622011902 2.08023669764227e-07 0.0000002 Toll Like Receptor TLR1:TLR2 Cascade
33 R-HSA-168188 4 0.141579622011902 2.08023669764227e-07 0.0000002 Toll Like Receptor TLR6:TLR2 Cascade
44 R-HSA-181438 4 0.141579622011902 2.08023669764227e-07 0.0000002 Toll Like Receptor 2 (TLR2) Cascade
179 R-HSA-975138 4 0.141579622011902 2.08023669764227e-07 0.0000002 TRAF6 mediated induction of NFkB and MAP kinases upon TLR7/8 or 9 activation
183 R-HSA-975871 4 0.141579622011902 2.08023669764227e-07 0.0000002 MyD88 cascade initiated on plasma membrane
12 R-HSA-166016 4 0.135476447152699 3.84382580101352e-07 0.0000005 Toll Like Receptor 4 (TLR4) Cascade
35 R-HSA-168898 4 0.135476447152699 3.84382580101352e-07 0.0000005 Toll-Like Receptors Cascades
100 R-HSA-400685 4 0.103852168540807 1.51313003420957e-06 0.0000017 Sema4D in semaphorin signaling
97 R-HSA-3928665 4 0.115442790036247 4.68758733659412e-06 0.0000052 EPH-ephrin mediated repulsion of cells
54 R-HSA-1963642 4 0.134373005439062 5.71484609954307e-06 0.0000063 PI3K events in ERBB2 signaling
166 R-HSA-69563 6 0.163574231031668 6.16914049604083e-36 0.0000000 p53-Dependent G1 DNA Damage Response
167 R-HSA-69580 6 0.163574231031668 6.16914049604083e-36 0.0000000 p53-Dependent G1/S DNA damage checkpoint
168 R-HSA-69615 6 0.163574231031668 6.16914049604083e-36 0.0000000 G1/S DNA Damage Checkpoints
71 R-HSA-2559583 6 0.118512570615044 1.79665216695224e-32 0.0000000 Cellular Senescence
74 R-HSA-2559586 6 0.166339581526184 3.69015347631317e-32 0.0000000 DNA Damage/Telomere Stress Induced Senescence
152 R-HSA-6791312 6 0.160307590148722 5.91855928379218e-32 0.0000000 TP53 Regulates Transcription of Cell Cycle Genes
63 R-HSA-2219528 6 0.150273201542753 1.96024379213766e-31 0.0000000 PI3K/AKT Signaling in Cancer
161 R-HSA-69206 6 0.104638891204408 1.93374648134732e-31 0.0000000 G1/S Transition
160 R-HSA-69202 6 0.1044133774531 5.09471406738826e-31 0.0000000 Cyclin E associated events during G1/S transition
108 R-HSA-5218920 6 0.188383693020523 3.65783580166394e-29 0.0000000 VEGFR2 mediated vascular permeability
66 R-HSA-2262752 6 0.109931542668385 5.155560602407e-29 0.0000000 Cellular responses to stress
170 R-HSA-69656 6 0.104638891204408 7.85386487154019e-28 0.0000000 Cyclin A:Cdk2-associated events at S phase entry
164 R-HSA-69242 6 0.104638891204408 1.10995848521904e-27 0.0000000 S Phase
171 R-HSA-8848021 6 0.131396735988981 1.74865525581942e-27 0.0000000 Signaling by PTK6
154 R-HSA-6804757 6 0.152256513681293 2.03704657428087e-27 0.0000000 Regulation of TP53 Degradation
157 R-HSA-6806003 6 0.152256513681293 2.03704657428087e-27 0.0000000 Regulation of TP53 Expression and Degradation
11 R-HSA-1257604 6 0.131329721531884 5.78919514625642e-27 0.0000000 PIP3 activates AKT signaling
42 R-HSA-180292 6 0.131329721531884 5.78919514625642e-27 0.0000000 GAB1 signalosome
77 R-HSA-2730905 6 0.126422234964336 1.22131426717467e-26 0.0000000 Role of LAT2/NTAL/LAB on calcium mobilization
57 R-HSA-198203 6 0.125586371099449 1.52398893332487e-26 0.0000000 PI3K/AKT activation
4 R-HSA-109704 6 0.176736865803021 2.67330922565291e-26 0.0000000 PI3K Cascade
89 R-HSA-389356 6 0.175049474706952 1.22657370453753e-25 0.0000000 CD28 co-stimulation
88 R-HSA-388841 6 0.152256513681293 1.39034856023611e-24 0.0000000 Costimulation by the CD28 family
59 R-HSA-199418 6 0.131329721531884 4.4481937945955e-24 0.0000000 Negative regulation of the PI3K/AKT network
91 R-HSA-390466 6 0.161176060591935 2.38254362776208e-23 0.0000000 Chaperonin-mediated protein folding
93 R-HSA-391251 6 0.161176060591935 2.38254362776208e-23 0.0000000 Protein folding
92 R-HSA-390471 6 0.162420089177491 4.53463481111177e-23 0.0000000 Association of TriC/CCT with target proteins during biosynthesis
162 R-HSA-69231 6 0.133655680740526 1.07920078678361e-22 0.0000000 Cyclin D associated events in G1
163 R-HSA-69236 6 0.133655680740526 1.07920078678361e-22 0.0000000 G1 Phase
175 R-HSA-8863795 6 0.196478357886946 1.08088925175971e-22 0.0000000 Downregulation of ERBB2 signaling
6 R-HSA-1168372 6 0.104789193119891 3.65454879046344e-22 0.0000000 Downstream signaling events of B Cell Receptor (BCR)
159 R-HSA-6811558 6 0.164728372885845 7.07868823477374e-22 0.0000000 PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling
128 R-HSA-5654726 6 0.2086584780258 7.3486168950305e-22 0.0000000 Negative regulation of FGFR1 signaling
169 R-HSA-69620 6 0.104789193119891 1.45387755853743e-21 0.0000000 Cell Cycle Checkpoints
186 R-HSA-983705 6 0.1044133774531 1.84065076725682e-21 0.0000000 Signaling by the B Cell Receptor (BCR)
65 R-HSA-2219530 6 0.175049474706952 1.9785732331926e-21 0.0000000 Constitutive Signaling by Aberrant PI3K in Cancer
90 R-HSA-389357 6 0.178271626601345 2.59500038252183e-21 0.0000000 CD28 dependent PI3K/Akt signaling
81 R-HSA-373760 6 0.196478357886946 1.3678444649737e-20 0.0000000 L1CAM interactions
85 R-HSA-3769402 6 0.110052628743431 5.41203631730405e-20 0.0000000 Deactivation of the beta-catenin transactivating complex
130 R-HSA-5654727 6 0.203673171891168 5.74292148324212e-19 0.0000000 Negative regulation of FGFR2 signaling
114 R-HSA-5654689 6 0.203382889124788 1.28241269803462e-18 0.0000000 PI-3K cascade:FGFR1
2 R-HSA-109606 6 0.194838125600575 4.69939941267943e-17 0.0000000 Intrinsic Pathway for Apoptosis
68 R-HSA-2559580 6 0.138069351667405 1.31157318262041e-16 0.0000000 Oxidative Stress Induced Senescence
86 R-HSA-381340 6 0.203403580004792 2.45547083713807e-16 0.0000000 Transcriptional regulation of white adipocyte differentiation
73 R-HSA-2559585 6 0.155125189539284 6.91045715748456e-16 0.0000000 Oncogene Induced Senescence
116 R-HSA-5654695 6 0.184196567360406 6.86477052917807e-16 0.0000000 PI-3K cascade:FGFR2
115 R-HSA-5654693 6 0.189914838091013 9.19648058433477e-16 0.0000000 FRS-mediated FGFR1 signaling
37 R-HSA-169893 6 0.103449044970857 1.74590323859923e-15 0.0000000 Prolonged ERK activation events
38 R-HSA-170968 6 0.103449044970857 1.74590323859923e-15 0.0000000 Frs2-mediated activation
47 R-HSA-187687 6 0.103449044970857 1.74590323859923e-15 0.0000000 Signalling to ERKs
131 R-HSA-5654732 6 0.209784534529589 2.0339572413955e-15 0.0000000 Negative regulation of FGFR3 signaling
133 R-HSA-5654733 6 0.205949436803472 2.17759628201591e-15 0.0000000 Negative regulation of FGFR4 signaling
5 R-HSA-112412 6 0.103086223428299 2.91192468576799e-15 0.0000000 SOS-mediated signalling
20 R-HSA-167044 6 0.103086223428299 2.91192468576799e-15 0.0000000 Signalling to RAS
39 R-HSA-170984 6 0.103086223428299 2.91192468576799e-15 0.0000000 ARMS-mediated activation
40 R-HSA-179812 6 0.103086223428299 2.91192468576799e-15 0.0000000 GRB2 events in EGFR signaling
43 R-HSA-180336 6 0.103086223428299 2.91192468576799e-15 0.0000000 SHC1 events in EGFR signaling
48 R-HSA-187706 6 0.103086223428299 2.91192468576799e-15 0.0000000 Signalling to p38 via RIT and RIN
82 R-HSA-375165 6 0.103086223428299 2.91192468576799e-15 0.0000000 NCAM signaling for neurite out-growth
142 R-HSA-5673001 6 0.103086223428299 2.91192468576799e-15 0.0000000 RAF/MAP kinase cascade
143 R-HSA-5684996 6 0.103086223428299 2.91192468576799e-15 0.0000000 MAPK1/MAPK3 signaling
49 R-HSA-190236 6 0.164505116679766 3.44407375719912e-15 0.0000000 Signaling by FGFR
7 R-HSA-1226099 6 0.15936742005519 1.84051795469108e-14 0.0000000 Signaling by FGFR in disease
8 R-HSA-1227986 6 0.114519753596792 4.48987378349295e-14 0.0000000 Signaling by ERBB2
46 R-HSA-1839124 6 0.195465032607908 4.73210303501712e-14 0.0000000 FGFR1 mutant receptor activation
165 R-HSA-69541 6 0.178069995058495 1.00844409703753e-13 0.0000000 Stabilization of p53
1 R-HSA-109581 6 0.152179607398936 1.04587017896073e-13 0.0000000 Apoptosis
109 R-HSA-5357801 6 0.152179607398936 1.04587017896073e-13 0.0000000 Programmed Cell Death
112 R-HSA-5654687 6 0.17619060686178 2.95059280580149e-13 0.0000000 Downstream signaling of activated FGFR1
119 R-HSA-5654700 6 0.181662107096329 3.10696235534376e-13 0.0000000 FRS-mediated FGFR2 signaling
134 R-HSA-5654736 6 0.167177065945052 3.65835612790317e-13 0.0000000 Signaling by FGFR1
156 R-HSA-6804760 6 0.17619060686178 6.42480315821795e-13 0.0000000 Regulation of TP53 Activity through Methylation
153 R-HSA-6804114 6 0.178814329233032 6.94727708538222e-13 0.0000000 TP53 Regulates Transcription of Genes Involved in G2 Cell Cycle Arrest
50 R-HSA-1912408 6 0.183173874985823 9.29576388871165e-13 0.0000000 Pre-NOTCH Transcription and Translation
155 R-HSA-6804758 6 0.171804221242074 1.03447448986687e-12 0.0000000 Regulation of TP53 Activity through Acetylation
113 R-HSA-5654688 6 0.196542214235157 1.09592538810522e-12 0.0000000 SHC-mediated cascade:FGFR1
127 R-HSA-5654720 6 0.184522477916402 1.71266396916768e-12 0.0000000 PI-3K cascade:FGFR4
139 R-HSA-5655302 6 0.164393371544755 1.85418664449361e-12 0.0000000 Signaling by FGFR1 in disease
123 R-HSA-5654710 6 0.193048637478027 1.90557279403111e-12 0.0000000 PI-3K cascade:FGFR3
70 R-HSA-2559582 6 0.136435102280291 1.9278171960276e-12 0.0000000 Senescence-Associated Secretory Phenotype (SASP)
51 R-HSA-1912422 6 0.181816483781491 3.75757630854625e-12 0.0000000 Pre-NOTCH Expression and Processing
94 R-HSA-392451 6 0.102391778690985 6.31942545786839e-12 0.0000000 G beta:gamma signalling through PI3Kgamma
99 R-HSA-397795 6 0.102391778690985 6.31942545786839e-12 0.0000000 G-protein beta:gamma signalling
79 R-HSA-373755 6 0.163316717089663 1.73187541764886e-11 0.0000000 Semaphorin interactions
117 R-HSA-5654696 6 0.149982443799756 5.9825590348103e-11 0.0000000 Downstream signaling of activated FGFR2
138 R-HSA-5655253 6 0.149982443799756 5.9825590348103e-11 0.0000000 Signaling by FGFR2 in disease
135 R-HSA-5654738 6 0.141114922885457 6.86563006005631e-11 0.0000000 Signaling by FGFR2
83 R-HSA-376176 6 0.183623594979209 2.14287569338875e-10 0.0000000 Signaling by Robo receptor
124 R-HSA-5654712 6 0.168759366581101 4.46942232901848e-10 0.0000000 FRS-mediated FGFR4 signaling
19 R-HSA-166166 6 0.13505549202586 5.00342335673692e-10 0.0000000 MyD88-independent TLR3/TLR4 cascade
26 R-HSA-168164 6 0.13505549202586 5.00342335673692e-10 0.0000000 Toll Like Receptor 3 (TLR3) Cascade
178 R-HSA-937061 6 0.13505549202586 5.00342335673692e-10 0.0000000 TRIF-mediated TLR3/TLR4 signaling
121 R-HSA-5654706 6 0.17523692301412 5.82506066227636e-10 0.0000000 FRS-mediated FGFR3 signaling
15 R-HSA-166054 6 0.13505549202586 7.40306092069974e-10 0.0000000 Activated TLR4 signalling
22 R-HSA-168138 6 0.13505549202586 7.40306092069974e-10 0.0000000 Toll Like Receptor 9 (TLR9) Cascade
32 R-HSA-168181 6 0.13505549202586 7.40306092069974e-10 0.0000000 Toll Like Receptor 7/8 (TLR7/8) Cascade
182 R-HSA-975155 6 0.13505549202586 7.40306092069974e-10 0.0000000 MyD88 dependent cascade initiated on endosome
98 R-HSA-3928665 6 0.192854526292227 8.22835218472964e-10 0.0000000 EPH-ephrin mediated repulsion of cells
17 R-HSA-166058 6 0.143248979847831 1.05188570113004e-09 0.0000000 MyD88:Mal cascade initiated on plasma membrane
24 R-HSA-168142 6 0.143248979847831 1.05188570113004e-09 0.0000000 Toll Like Receptor 10 (TLR10) Cascade
28 R-HSA-168176 6 0.143248979847831 1.05188570113004e-09 0.0000000 Toll Like Receptor 5 (TLR5) Cascade
30 R-HSA-168179 6 0.143248979847831 1.05188570113004e-09 0.0000000 Toll Like Receptor TLR1:TLR2 Cascade
34 R-HSA-168188 6 0.143248979847831 1.05188570113004e-09 0.0000000 Toll Like Receptor TLR6:TLR2 Cascade
45 R-HSA-181438 6 0.143248979847831 1.05188570113004e-09 0.0000000 Toll Like Receptor 2 (TLR2) Cascade
180 R-HSA-975138 6 0.143248979847831 1.05188570113004e-09 0.0000000 TRAF6 mediated induction of NFkB and MAP kinases upon TLR7/8 or 9 activation
184 R-HSA-975871 6 0.143248979847831 1.05188570113004e-09 0.0000000 MyD88 cascade initiated on plasma membrane
103 R-HSA-450294 6 0.162973566722587 1.22288862099984e-09 0.0000000 MAP kinase activation in TLR cascade
75 R-HSA-2682334 6 0.106234165176346 1.7091420198163e-09 0.0000000 EPH-Ephrin signaling
61 R-HSA-2029482 6 0.168197627999717 6.90855391597471e-09 0.0000000 Regulation of actin dynamics for phagocytic cup formation
13 R-HSA-166016 6 0.117870810743105 9.55211957596783e-09 0.0000000 Toll Like Receptor 4 (TLR4) Cascade
36 R-HSA-168898 6 0.117870810743105 9.55211957596783e-09 0.0000000 Toll-Like Receptors Cascades
78 R-HSA-3214858 6 0.139322757701264 9.45086483596051e-09 0.0000000 RMTs methylate histone arginines
55 R-HSA-1963642 6 0.204442546887885 1.54713345516581e-08 0.0000000 PI3K events in ERBB2 signaling
118 R-HSA-5654699 6 0.180040793554223 2.05058839489934e-08 0.0000000 SHC-mediated cascade:FGFR2
185 R-HSA-983231 6 0.101187314067097 3.78204621566802e-08 0.0000001 Factors involved in megakaryocyte development and platelet production
125 R-HSA-5654716 6 0.112753083450221 4.46002952435291e-08 0.0000001 Downstream signaling of activated FGFR4
137 R-HSA-5654743 6 0.111928700819091 4.68240717581215e-08 0.0000001 Signaling by FGFR4
122 R-HSA-5654708 6 0.126382655875272 6.38398344004582e-08 0.0000001 Downstream signaling of activated FGFR3
140 R-HSA-5655332 6 0.126382655875272 6.38398344004582e-08 0.0000001 Signaling by FGFR3 in disease
174 R-HSA-8853338 6 0.126382655875272 6.38398344004582e-08 0.0000001 Signaling by FGFR3 point mutants in cancer
136 R-HSA-5654741 6 0.112753083450221 6.83794452415392e-08 0.0000001 Signaling by FGFR3
95 R-HSA-3928662 6 0.150416046395665 1.73754967513525e-07 0.0000002 EPHB-mediated forward signaling
96 R-HSA-3928663 6 0.110898529367959 1.87549038498414e-07 0.0000002 EPHA-mediated growth cone collapse
9 R-HSA-1250196 6 0.172423031278246 9.6053556097146e-07 0.0000011 SHC1 events in ERBB2 signaling
53 R-HSA-1963640 6 0.172423031278246 9.6053556097146e-07 0.0000011 GRB2 events in ERBB2 signaling
110 R-HSA-5617472 6 0.13365088429017 9.8655906620644e-07 0.0000011 Activation of anterior HOX genes in hindbrain development during early embryogenesis
111 R-HSA-5619507 6 0.13365088429017 9.8655906620644e-07 0.0000011 Activation of HOX genes during differentiation
101 R-HSA-400685 6 0.128774113154716 1.43897006120777e-06 0.0000016 Sema4D in semaphorin signaling
173 R-HSA-8851805 6 0.117444747747302 2.68336319428933e-05 0.0000295 MET activates RAS signaling
126 R-HSA-5654719 6 0.137323920383 3.68249172347517e-05 0.0000403 SHC-mediated cascade:FGFR4
120 R-HSA-5654704 6 0.147967798750578 4.89863113928989e-05 0.0000533 SHC-mediated cascade:FGFR3
106 R-HSA-453279 7 0.110889810549242 7.50929025234752e-20 0.0000000 Mitotic G1-G1/S phases
151 R-HSA-6785807 7 0.105402843779737 1.93583573871198e-18 0.0000000 Interleukin-4 and 13 signaling
52 R-HSA-195721 7 0.110146357993067 4.26885062439873e-17 0.0000000 Signaling by Wnt
60 R-HSA-201681 7 0.109536252555609 8.12155889253289e-17 0.0000000 TCF dependent signaling in response to WNT
105 R-HSA-452723 7 0.103223636213854 4.5266594431579e-11 0.0000000 Transcriptional regulation of pluripotent stem cells

Changes in exposures between cases with single driver gene mutations versus non-mutated

driver_genes<-c("NF1","RB1","BRCA1","BRCA2","CDK12","CCNE1","PTEN","MYC","PIK3CA")
d_mut_ratio_info<-c()
all_exposures_mut_wt_gene<-c()
for(g in driver_genes)
{
  mutated<-mut_matrix[,g]
  for(s in rownames(all_sig))
  {
    mut<-as.numeric(all_sig[s,rownames(mut_matrix)[mutated]])
    nonmut<-as.numeric(all_sig[s,rownames(mut_matrix)[!mutated]])
    all_exposures_mut_wt_gene<-rbind(all_exposures_mut_wt_gene,rbind(cbind(s,g,"mut",mut),cbind(s,g,"wt",nonmut)))

    if(median(mut)>median(nonmut))
    {
    d_mut_ratio_info<-rbind(d_mut_ratio_info,c(g,s,median(mut)-median(nonmut),wilcox.test(mut,nonmut,alternative="greater")$p.value))
    }
  }
}
colnames(d_mut_ratio_info)<-c("Gene","Signature","median_diff","pvalue")
d_mut_ratio_info<-data.frame(d_mut_ratio_info,stringsAsFactors = F)
d_mut_ratio_info<-d_mut_ratio_info[d_mut_ratio_info$median_diff>=0.07,]
d_mut_ratio_info$adj.pvalue<-p.adjust(d_mut_ratio_info$pvalue)
d_mut_ratio_info<-d_mut_ratio_info[d_mut_ratio_info$adj.pvalue<=0.05,]
knitr::kable(d_mut_ratio_info) %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
Gene Signature median_diff pvalue adj.pvalue
1 NF1 s1 0.262179436462879 0.00434334955231847 0.0304034
7 BRCA1 s3 0.200486753115807 0.00899089052048452 0.0449545
10 BRCA2 s3 0.242551077528908 0.00542057060793141 0.0325234
11 CDK12 s2 0.119981131001674 0.000227495729328335 0.0020475
12 CDK12 s4 0.179205973812764 8.68644521119764e-05 0.0008686
15 CCNE1 s4 0.0852403555464706 1.76266741440831e-11 0.0000000
16 CCNE1 s6 0.179899918038168 7.39094550209736e-27 0.0000000
18 PTEN s3 0.338255799041865 0.000248900414623073 0.0020475
20 MYC s4 0.100097994114185 7.17013743046833e-13 0.0000000
22 MYC s7 0.0805943170357554 9.70124826320499e-08 0.0000011

Manual selection of representative pathways showing differential exposures

selected_path<-read.table("data/pathway_results_filtered.csv",header = T,sep="\t",stringsAsFactors = F)
selected_pathways<-mut_ratio_info[mut_ratio_info$Type%in%selected_path$Type,c(1,6,2,3,4,5,7)]

CGImuts<-readRDS("data/CGI_mut_role.rds")
CGIcna<-readRDS("data/CGI_cna_role.rds")

#manual summary results
gene_roles<-c()
for(i in 1:nrow(selected_pathways))
{
mut<-CGImuts[CGImuts$gene%in%unlist(strsplit(selected_pathways[i,"geneID"],"/")),]
cna<-CGIcna[CGIcna$gene%in%unlist(strsplit(selected_pathways[i,"geneID"],"/")),]
gene_roles<-rbind(gene_roles,cbind(selected_pathways[i,c(1:2)],mut))
if(nrow(cna)>1)
{
  colnames(cna)<-colnames(mut)
  gene_roles<-rbind(gene_roles,cbind(selected_pathways[i,c(1:2)],cna))
}
}
gene_roles<-gene_roles[order(gene_roles[,1],gene_roles[,2],gene_roles[,4]),]
all_diff_exp<-selected_pathways

#generate pathway mutation plots
pdat<-c()
for(p in unique(selected_pathways$Type))
{
  desc<-unique(selected_pathways[selected_pathways$Type==p,"Description"])
  curr_genes<-pathway_genes[[p]]
  for(g in curr_genes)
  {
    m<-CGImuts[CGImuts$gene==g,c("sample","gene","consequence")]
    m<-m[!duplicated(m$sample),]
    c<-CGIcna[CGIcna$gene==g,c("sample","gene","cna")]
    c<-c[!duplicated(c$sample),]
    colnames(c)<-c("sample","gene","consequence")
    if(nrow(m)>0)
    {
      pdat<-rbind(pdat,cbind(desc,m))
    }
    if(nrow(c)>0)
    {
      pdat<-rbind(pdat,cbind(desc,c))
    }
  }
}
pdat<-pdat[!pdat$gene=="TP53",]  

ord<-pdat %>% group_by(gene) %>% summarise(count=n()) %>% arrange(count)

pdat$gene<-factor(pdat$gene,levels=ord$gene)

ggplot(pdat,aes(x=gene,fill=consequence))+
  geom_bar(stat="count",position = "stack")+
  facet_grid(desc ~ .,scales="free",space="free")+
  coord_flip()+my_theme+theme(strip.text.y = element_text(angle = 0),axis.text.y=element_text(face="italic"))+scale_fill_manual(values=cbPalette)+xlab("Driver genes")+ylab("Mutated cases")

mut_path_count<-reshape2::melt(table(pdat[,c(1,3)]))
mut_path_count<-mut_path_count[mut_path_count$value>0,]

Visualising differences in exposures

pdat<-all_exposures_mut_wt[all_exposures_mut_wt$Description%in%selected_pathways$Description,]
pdat$highlight<-FALSE

pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Description=="PI3K/AKT Signaling in Cancer"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Description=="Toll Like Receptor 3 (TLR3) Cascade"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s7")&pdat$Description=="Signaling by Wnt"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s1")&pdat$Description=="Regulation of RAS by GAPs"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s7")&pdat$Description=="Interleukin-4 and 13 signaling"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cyclin E associated events during G1/S transition "]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cellular Senescence"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Description=="HDR through Homologous Recombination (HRR)"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s6")&pdat$Description=="Cyclin D associated events in G1"]<-TRUE

pdat$Description<-plyr::revalue(pdat$Description,c(
                       "PI3K/AKT Signaling in Cancer"="PI3K/AKT\nsignaling",
                       "Cyclin E associated events during G1/S transition "="Cyclin E\nevents\nin G1/S",
                       "Interleukin-4 and 13 signaling"="Interleukin-4\nand 13\nsignaling",
                       "Regulation of RAS by GAPs"="RAS\nsignaling",
                       "Signaling by Wnt"="Wnt\nsignaling",
                       "Cyclin D associated events in G1"="Cyclin D\nevents in G1",
                       "Toll Like Receptor 3 (TLR3) Cascade"="Toll-Like\nReceptors\nCascades",
                       "HDR through Homologous Recombination (HRR)"="Homologous\nrecombination"))




pdat$Signature<-plyr::revalue(pdat$Signature,
                         c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))
p<-ggplot(pdat,aes(x=Status,y=as.numeric(Exposure)))+
   geom_rect(data = pdat,aes(fill = highlight==TRUE),xmin = -Inf,xmax = Inf,
            ymin = -Inf,ymax = Inf,alpha = 0.1)+
  geom_boxplot(outlier.size = 0.5,outlier.alpha = 0.3,varwidth = T,lwd=0.25)+facet_grid(Signature~Description)+
  theme_bw()+theme(axis.text=element_text(size=6),axis.title=element_text(size=6),legend.position = "none",
  strip.text = element_text(size = 5))+ylab("Exposure")+coord_cartesian(ylim=c(0,0.7))+
  scale_fill_manual(values=c("white","grey"))
p

pdat<-data.frame(all_exposures_mut_wt_gene,stringsAsFactors = F)
colnames(pdat)<-c("Signature","Gene","Status","Exposure")
pdat$highlight<-FALSE

pdat$highlight[pdat$Signature%in%c("s1")&pdat$Gene=="NF1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="BRCA1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="BRCA2"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s2","s4")&pdat$Gene=="CDK12"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s6")&pdat$Gene=="CCNE1"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s4","s7")&pdat$Gene=="MYC"]<-TRUE
pdat$highlight[pdat$Signature%in%c("s3")&pdat$Gene=="PTEN"]<-TRUE


p<-ggplot(pdat,aes(x=Status,y=as.numeric(Exposure)))+
   geom_rect(data = pdat,aes(fill = highlight==TRUE),xmin = -Inf,xmax = Inf,
            ymin = -Inf,ymax = Inf,alpha = 0.1)+
  geom_boxplot(outlier.size = 0.5,outlier.alpha = 0.3,varwidth = T,lwd=0.25)+facet_grid(Signature~Gene)+
  theme_bw()+theme(axis.text=element_text(size=6),axis.title=element_text(size=6),legend.position = "none",
  strip.text = element_text(size = 5),strip.text.x=element_text(face="italic"))+ylab("Exposure")+#coord_cartesian(ylim=c(0,0.7))+
  scale_fill_manual(values=c("white","grey"))
p

SNV signatures

Extract COSMIC mutational signature exposures from SNV calls

This section describes how the trinuleotide motif matrices were extracted from deepWGS Britoc-1 and PCAWG samples using the SomaticSignatures R package and mutational signature exposures extracted using the deconstructSigs R package.

The VCF files can be obtained by writing to the relevant Data Access Committees and placed in data/britroc/snv/ and restricted_data/pcawg_snv/ before running the code below. The human genome reference files used in each cohort should be placed in data/reference/britroc/GRCh37.fa and restricted_data/reference/pcawg/genome.fa.

The motif matrix files derived from the SNV calls used to generate the COSMIC mutational signature exposures of both cohorts are provided.

suppressMessages(library(dplyr))
suppressMessages(library(SomaticSignatures))
suppressMessages(library(deconstructSigs))
suppressMessages(library(NMF))
# Helper functions to extract information from VCF files

# Filter variants passing all filters in the VCF file
filterPass <- function(vrobj){
  
  if(dim(softFilterMatrix(vrobj))[2] !=0){
    nfilters <- dim(softFilterMatrix(vrobj))[2]
    vr <- vrobj[which(rowSums(softFilterMatrix(vrobj)) ==nfilters)]
  }else{
    vr <- vrobj
  }
  return(vr)
}

# Extract trinulceotide frequencies from VCF file
createMotifMatrix <- function(vcffile){
    
    s <- samples(scanVcfHeader(vcffile))
    s <- s[grep("tumor",s)] # select the tumor sample from the vcf which contains both the tumour and matched normal variants
    vr_in <- readVcfAsVRanges(vcffile, "hg19", param = ScanVcfParam(info=NA, samples=s,geno="GT"))
    vr <- filterPass(vr_in)
    motifs <- mutationContext(vr,ref_fasta)
    mm <- motifMatrix(motifs)
  
    return(mm)
}


# Extract mutational signature exposures

getWeights <- function(motifmatrix,signature){
  
  # convert to format required by deconstructSigs
  mm <- data.frame(t(motifmatrix))
  # use the column names in the deconstructSigs signatures dataframe
  colnames(mm) <- colnames(signature)
  
  wt_df <- data.frame()
  
  for(s in rownames(mm)){
    ws <- whichSignatures(tumor.ref = mm, signatures.ref = signature, sample.id = s)
    #add unknown component
    Unknown <- ws$unknown
    wt_df <- rbind(wt_df,cbind(ws$weights,Unknown))
  }
  return(wt_df)
}
Britroc-1

The code below is to generate the motif matrix for the Britroc cohort from the original snv calls.

#Extract trinucleotide frequencies from the BritROC-1 SNV calls

# Subset to list of samples passing QC
samplestatus <- read.csv("data/britroc_deepWGS_sample_status.csv", as.is=T)
trialno <- unique(samplestatus$britroc_ID[samplestatus$status=="final"])

# Create FaFile object from Reference
ref_fasta <- FaFile("data/reference/britroc/GRCh37.fa")

# Create motif matrix from deepWGS BriTROC-1 data
vcfpath <- "data/britroc/snv/"
pat <- "snv.filters.blacklist.vcf" #  suffix of VCF files to be processed
vcflist <- list.files(vcfpath, full.names=TRUE, pattern=pat)
  
# select samples passing QC 
pat <- gsub("[$]", "", pat)
vcf_in <- paste0(vcfpath,"/", trialno, ".", pat)
  
# Select only files that exist
vcf_in <- vcf_in[vcf_in %in% vcflist]
  
motifmat_list <- lapply(vcf_in, createMotifMatrix)
motifmat <- do.call(cbind.data.frame, motifmat_list)
write.table(motifmat, "data/britroc_deepWGS_snv_motif_matrix.txt", sep = "\t", quote=F)

cosmic_weights <- getWeights(motifmat,signatures.cosmic)
cosmic_out <- cbind.data.frame(row.names(cosmic_weights), cosmic_weights, stringsAsFactors=F)
colnames(cosmic_out)[1] <- "Sample"
cosmic_outfile <- "data/britroc_cosmic_weights.txt"
write.table(cosmic_out, file = cosmic_outfile, row.names = F, quote = F,sep="\t")

The code below is to get mutational signature exposures for the Britroc cohort from the motif matrix file generated above.

# Extract weights of COSMIC mutational signatures from the BriTROC-1 cohort
motifmat <- read.table("data/britroc_deepWGS_snv_motif_matrix.txt", sep="\t", header=T)
cosmic_weights <- getWeights(motifmat,signatures.cosmic)
rownames(cosmic_weights) <- gsub("X","",rownames(cosmic_weights))
rownames(cosmic_weights) <- gsub("[.]","-",rownames(cosmic_weights))
cosmic_out <- cbind.data.frame(row.names(cosmic_weights), cosmic_weights, stringsAsFactors=F)
colnames(cosmic_out)[1] <- "Sample"
cosmic_outfile <- "data/britroc_cosmic_weights.txt"
write.table(cosmic_out, file = cosmic_outfile, row.names = F, quote = F,sep="\t")
PCAWG

The code below is to generate the motif matrix for the PCAWG cohort from the original snv calls.

# Extract trinucleotide frequencies from the PCAWG (OV-AU, OV-US) SNV calls
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id

# Consensus SNV data must be placed in the following directory and files must have the
# .consensus.20160830.somatic.snv_mnv.vcf.gz extension
snvdir <-"restricted_data/pcawg_snv/"
ext <- ".consensus.20160830.somatic.snv_mnv.vcf.gz"
pcawg_ov_filelist  <- paste0(snvdir,pcawg_ids, ext)
pcawg_ref <- FaFile("restricted_data/reference/pcawg/genome.fa")
pcawg_vranges <- lapply(pcawg_ov_filelist, readVcfAsVRanges, param = ScanVcfParam(info=NA))
pcawg_vranges <- lapply(
  seq_along(pcawg_vranges), function(v, n, i) {
    sampleNames(v[[i]]) = n[i]; v[[i]]
    }, 
  v = pcawg_vranges, 
  n = pcawg_ids
  )
pcawg_mc <- lapply(pcawg_vranges, mutationContext, pcawg_ref) 
pcawg_mm <- data.frame(lapply(pcawg_mc, motifMatrix))
write.table("data/pcawg_snv_motif_matrix.txt", sep="\t", quote=F)

pcawg_cosmic_weights <- getWeights(pcawg_mm, signatures.cosmic)
pcawg_samplenames <- sub("X", "",row.names(pcawg_cosmic_weights))
pcawg_samplenames <- gsub("[.]", "-", pcawg_samplenames)
pcawg_out <- cbind.data.frame(pcawg_samplenames,pcawg_cosmic_weights)
colnames(pcawg_out)[1] <- "Sample"
pcawg_cosmic_outfile <- "data/pcawg_cosmic_weights.txt"
write.table(pcawg_out, file = pcawg_cosmic_outfile, row.names=F, quote = F, sep="\t")

The code below is to get mutational signature exposures for the PCAWG cohort from the motif matrix file (pcawg_mm) generated above.

# Extract weights of COSMIC mutational signatures 
pcawg_mm <- read.table("data/pcawg_snv_motif_matrix.txt", as.is=T, header=T, sep="\t")
pcawg_cosmic_weights <- getWeights(pcawg_mm, signatures.cosmic)
pcawg_samplenames <- sub("X", "",row.names(pcawg_cosmic_weights))
pcawg_samplenames <- gsub("[.]", "-", pcawg_samplenames)
pcawg_out <- cbind.data.frame(pcawg_samplenames,pcawg_cosmic_weights, stringsAsFactors=F)
colnames(pcawg_out)[1] <- "Sample"
pcawg_cosmic_outfile <- "data/pcawg_cosmic_weights.txt"
write.table(pcawg_out, file = pcawg_cosmic_outfile, row.names=F, quote = F,sep="\t")

Correlation between SNV and CN signature exposures

pcawg_cosmic_weights = read.table(file = "data/pcawg_cosmic_weights.txt",as.is = T, header=T,row.names = 1)
britroc_cosmic_weights = read.table(file = "data/britroc_cosmic_weights.txt",as.is = T, header=T,row.names = 1)

rownames(britroc_cosmic_weights)<-unlist(lapply(strsplit(rownames(britroc_cosmic_weights),"_"),"[[",3))

cosmic_weights<-rbind(pcawg_cosmic_weights,britroc_cosmic_weights)

cosmic_weights<-cosmic_weights[,colSums(cosmic_weights)>0]

SNV_sig<-cosmic_weights[rownames(cosmic_weights)%in%colnames(cbind(sig_pat_mat_britroc,sig_pat_mat_remain,sig_pat_mat_pcawg)),]
SNV_sig<-SNV_sig[,-ncol(SNV_sig)]#Remove unknown signature component
CN_sig<-cbind(sig_pat_mat_britroc,sig_pat_mat_remain,sig_pat_mat_pcawg)
CN_sig<-CN_sig[,colnames(CN_sig)%in%rownames(SNV_sig)]
CN_sig<-t(CN_sig)
CN_sig<-CN_sig[match(rownames(SNV_sig),rownames(CN_sig)),]

library(Hmisc)
SNV_CN_sig_cor<-rcorr(as.matrix(SNV_sig),as.matrix(CN_sig),type="pearson")
library(corrplot)
cormat<-SNV_CN_sig_cor$r[c(-22:-28),c(22:28)]
pval<-SNV_CN_sig_cor$P[c(-22:-28),c(22:28)]
qval<-p.adjust(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)],method="BH")
dim(qval)<-dim(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
rownames(qval)<-rownames(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
colnames(qval)<-colnames(SNV_CN_sig_cor$P[c(-22:-28),c(22:28)])
#cormat[qval>=0.1]<-0
colnames(cormat)<-1:nsig
colnames(pval)<-1:nsig
colnames(qval)<-1:nsig
integrated_matrix<-cbind(reshape2::melt(cormat),reshape2::melt(pval)[,3],reshape2::melt(qval)[,3])
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),]

colnames(CN_sig)<-1:nsig
for(i in paste0("Signature.",c(1,3,5,13,16)))
{
  for(s in 1:nsig)
  {
  all_cor_dat<-rbind(all_cor_dat,cbind(i,s,CN_sig[,s],SNV_sig[,i]))
  all_cor_dat_cors<-rbind(all_cor_dat_cors,c(Signature=s,cor=cormat[i,s],Feature=i))
  all_cor_dat_pvals<-rbind(all_cor_dat_pvals,c(Signature=s,pval=qval[i,s],Feature=i))
  }
}
rownames(all_cor_dat)<-NULL
all_cor_dat_cors<-data.frame(all_cor_dat_cors,stringsAsFactors = F)
all_cor_dat_pvals<-data.frame(all_cor_dat_pvals,stringsAsFactors = F)
knitr::kable(all_corrs) %>%
    kableExtra::kable_styling(full_width = T)
Variable Signature Correlation pvalue qvalue
3 Signature.3 1 -0.3126844 0.0000832 0.0024449
26 Signature.5 2 0.2562713 0.0013864 0.0226439
43 Signature.1 3 -0.3829189 0.0000010 0.0000760
45 Signature.3 3 0.3220786 0.0000491 0.0018038
97 Signature.16 5 -0.2564781 0.0013736 0.0226439
106 Signature.1 6 0.4116913 0.0000001 0.0000183
108 Signature.3 6 -0.3494260 0.0000096 0.0004680
116 Signature.13 6 0.2694577 0.0007566 0.0158890
129 Signature.3 7 0.2903792 0.0002715 0.0066508

Tandem-duplicator phenotype score

Tandem Duplicator Phenotype Score Calculation

This section describes how the Tandem Duplicator Phenotype (TDP) score was calculated for PCAWG ovarian cancer samples (OV-AU, OV-US) using the method described in Menghi F, et al. (2016) The tandem duplicator phenotype as a distinct genomic configuration in cancer. Proc Natl Acad Sci USA 113(17):E2373–E2382.

suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
# Helper functions to count the number of tandem duplications per chromosome, normalized by chromosome length. 
chr_size <- read.table("../data/hg19.chrom.sizes.txt", sep = "\t", as.is=T)
colnames(chr_size) <- c("chr", "size")
chr_size$scale <- chr_size$size/sum(as.numeric(chr_size$size))


getTdCountPerChr <- function(filename_vec){ 
  
  # Returns a dataframe of no. of duplication events per chromosome 
  # normalised by chromosome size, given a vector of filenames containing 
  # structural variant information
  
  lapply (filename_vec, function(filename){
    # dataframe to contain no. of tandem duplications per chromosome
    td_chr_df <- cbind.data.frame(c(1:22,"X"), rep(0,23), stringsAsFactors=F) 
    colnames(td_chr_df) <- c("chr", "TD")
    # read in structural variant file
    sv <- read.table(filename, header=T, sep ="\t", as.is = T)
    dup_counts <- sv %>%
      filter(svclass == "DUP") %>%
      group_by(chrom1) %>%
      summarise(n_TD=n())
  
    # update counts in the dataframe
    td_chr_df$TD[match(dup_counts$chrom1,td_chr_df$chr)] <- dup_counts$n_TD  
  
    # parse sample name from filename and add to dataframe
    string <- strsplit(filename,"/")[[1]][6]
    sample_name <- strsplit(string1,"[.]")[[1]][1]
    td_chr_df$Sample <- sample_name
    # the normaliation/weight to apply to td count based on chromosome length
    td_chr_df$wt <- chr_size$scale[match(td_chr_df$chr,chr_size$chr)] 
    td_chr_df
  }
)

}


classifyTDP = function(df){
  # Calculates TDP score and classifies samples into TDP or non-TDP classes. 
  # 0.71 (from Menghi et al.) was used as the threshold for classification
  df %>%
    group_by(Sample) %>%
    mutate(exp = wt*sum(TD), oe=abs(TD-exp), oesum= sum(oe)) %>%
    mutate(TDP.score.raw = round(-oesum/sum(TD),3)) %>%
    select(Sample, TDP.score.raw) %>% 
    unique() %>%
    mutate(TDP.score = TDP.score.raw+0.71) %>%
    mutate(TDP.status = ifelse(TDP.score>0,"TDP", "NON-TDP"))
}

The structural variant files from PCAWG should be placed in restricted_data/pcawg_sv/ before running the code below.

# Calulate TDP score from structural variant calls from PCAWG (OV-AU, OV-US) samples.
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist  <- paste0(path, pcawg_ids, ext)

# calculate TDP Score and classify samples 
results_ov <- getTdCountPerChr(pcawg_ov_filelist)
all_data_ov <- do.call(rbind,results_ov)
tdp_ov <- classifyTDP(all_data_ov)

write.table(tdp_ov, file = "data/pcawg_TDP_score.txt", sep="\t", quote=F, row.names = F)

Correlation between TDP score and CN signature exposures

TDPscores<-read.table("data/pcawg_TDP_score.txt",header=T,sep="\t",stringsAsFactors = F)
TDPscores<-TDPscores[TDPscores$Sample%in%colnames(sig_pat_mat_pcawg),]

pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,TDPscores,by.x=2,by.y=1)

cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, TDP.score,method="s",exact=F)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, TDP.score,method="s",exact=F)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"TDP score"
pvals$Feature<-"TDP score"

integrated_matrix<-cbind("Tandem duplicator phenotype score",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])

pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(TDP.score)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = 8,scales="free_x")+
  coord_cartesian(ylim = c(-1.2,1.2))+
  geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=0.9, hjust=-0.05, vjust=0,parse = T)+
    geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=0.55, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
  strip.text.x = element_text(size = 7))
p

pvals$pval<-p.adjust(pvals$pval,method="BH")

all_cor_dat<-rbind(all_cor_dat,cbind("TDP score",pdat$Signature,pdat$Exposure,pdat$TDP.score))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)

Chromothripsis

Chromothrispis estimation using Shatterproof

This section describes how the copy-number (.spc) and translocation (.spt) input files were generated from the pcawg copy-number and structural variant datasets to be used with the Shatterproof software.

suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist  <- paste0(path, pcawg_ids, ext)
pcawg_rds_idx <- which( names(pcawg_segTabs) %in% pcawg_ids)

# Column names for translocation dataframe
spt_colnames <- c("#chr1", "start", "end", "chr2", "start", "end", "quality")
# Column names for copy-number dataframe
spc_colnames <- c("#chr", "start", "end", "number", "quality")

for(sample_name in names(pcawg_segTabs)[pcawg_rds_idx]){
  output_path <- paste0("shatterproof/sample_data/",sample_name,"/")
  dir.create(output_path)
  # read in structural variant file
  sv_filename <- paste0(pcawg_fpath, sample_name, ext) 
  sv <- read.table(sv_filename, header=T, sep ="\t", as.is = T)
  
  # translocations
  sample_spt <- filter(sv, svclass=="TRA") %>%
    select(chrom1,start1,end1, chrom2,start2,end2) %>%
    mutate(quality=".", chrom1 = paste0("chr",chrom1), chrom2=paste0("chr",chrom2))
  colnames(sample_spt) <- spt_colnames
  write.table(sample_spt, file=paste0(output_path,sample_name,".spt"), row.names = F, quote = F, sep="\t")
  
  # rounded absolute CN values 
  CN <- pcawg_segTabs[[sample_name]]
  sample_spc <- mutate(CN, quality=".", segVal=round(as.numeric(segVal),0))
  colnames(sample_spc) <- spc_colnames
  write.table(sample_spc, file=paste0(output_path,sample_name,".spc"), row.names = F, quote = F, sep="\t")

} 

Shatterproof was then run using the following perl command in the shell script runShatterproof.sh.

perl -w shatterproof.pl --cnv shatterproof/sample_data/ --trans shatterproof/sample_data/  --tp53 --config ./config.pl --output shatterproof/sample_data/output/ 

The final score per detected chromothripsis-like event per sample was extracted from the suspect_regions.yml files using the shell script get_final_score.sh.

# Read in table of combined Shatterproof scores across all samples
sp_scores <- read.table("data/pcawg_chromothripsis_scores.txt", as.is = T, header=T, sep="\t")

# Generate a list of events above 80th, 85th, 90th and 95th percentiles of shatterproof scores
perc_rank_scores_df <- sp_scores%>% 
  mutate(perc_rank = round(percent_rank(score),3)) %>%
  filter(perc_rank %in% c(0.8, 0.85, 0.9, 0.95)) %>%
  group_by(perc_rank) %>%
  summarise(max_score = max(round(score,3)))

percentile <- perc_rank_scores_df$perc_rank*100
threshold_vec <- perc_rank_scores_df$max_score

for(i in seq_along(percentile)){
  
  threshold <- threshold_vec[i]
  perc <- percentile[i]
  
  out <- sp_scores %>% 
    filter(score > threshold) %>%
    group_by(sample) %>%
    summarise(n=n())
  out_file <- paste0("data/pcawg_chromothripsis_counts_high_scores_", perc, ".txt")
  
  write.table(out, out_file, row.names = F, quote = F,sep="\t")
}

A conservative threshold was set at the 95th percentile of our distribution of scores to minimise false positives. Calls with scores greater than 0.485 were used to obtain a count of chromothriptic events per sample.

high_score_events_df <- sp_scores %>% 
  filter(score > perc_rank_scores_df$max_score[perc_rank_scores_df$perc_rank==0.95]) %>%
  group_by(sample) %>%
  summarise(n_high_scores=n()) %>%
  group_by(n_high_scores) %>%
  summarise(n_samples=n())

Of 61 samples with scores above the threshold, 49 (80.3%) had 1-2 events, 11 samples (18%) had 3-6 events and 1 sample (1.6%) had 10 events.

# Summary statistics of all calls per sample
calls_per_sample_df <- sp_scores %>%
  group_by(sample) %>%
  summarise(n=n())

summary_calls <- summary(calls_per_sample_df$n)
summary_calls
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.00   22.00   22.67   28.00   47.00

Correlation between Shatterproof scores and CN signature exposures

CHRPscores<-read.table("data/pcawg_chromothripsis_counts_high_scores_95.txt",header=T,sep="\t",stringsAsFactors = F)
CHRPscores<-CHRPscores[CHRPscores$sample%in%colnames(sig_pat_mat_pcawg),]

pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,CHRPscores,by.x=2,by.y=1)

pdat<-pdat[pdat$n>0,]

cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, n,method="p",exact=F)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, n,method="p",exact=F)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Shatterproof score"
pvals$Feature<-"Shatterproof score"

integrated_matrix<-cbind("Shatterproof score",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])

pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(n)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = 8,scales="free_x")+
  coord_cartesian(ylim = c(0,10))+
  geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=7.5, hjust=-0.05, vjust=0,parse = T)+
    geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=8.5, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
  strip.text.x = element_text(size = 7))
p

all_cor_dat<-rbind(all_cor_dat,cbind("Shatterproof score",pdat$Signature,pdat$Exposure,pdat$n))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)

Telomere length

Telomere length estimation

Telomere lengths of deep WGS samples were estimated using the Telomerecat software http://telomerecat.readthedocs.org/.

Telomerecat was installed using the python installer:

pip install telomerecat

Once installed, the following command was used to output a length estimate for a given BAM file:

telomerecat bam2length /path/to/bamfile.bam

Correlation between telomere length and CN signature exposures

We adjusted the observed tumour sample telomere length to account for ploidy: adj_length=length*(1/purity). We then calculated the spearman rank correlation for all samples with exposures in a signature greater than 10%.

library(ppcor)
telo<-read.table("data/telomere_length_48_tumor.csv",sep=",",header=T,stringsAsFactors = F)
ID<-gsub("^.*?JBLAB","JBLAB",telo$Sample)
ID<-gsub("-ready.bam","",ID)
purity<-pData(all_CN[,colnames(all_CN)%in%ID])[,c("name","purity")]
purity_ploidy<-cbind(purity,getPloidy(all_CN[,colnames(all_CN)%in%ID]))
patient<-unlist(lapply(strsplit(telo$Sample,"_"),"[[",1))
telo<-data.frame(ID,patient,length=telo$Length,stringsAsFactors = F)
colnames(purity_ploidy)<-c("name","purity","ploidy")
telo<-merge(telo,purity_ploidy,by.x=1,by.y=1)

telo<-cbind(telo,age=surv_dat[match(telo$patient,surv_dat$TRIALNO),"AGE"])
telo<-telo[!is.na(telo$age),]

pdat<-reshape2::melt(cbind(sig_pat_mat_britroc,sig_pat_mat_remain))
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,telo,by.x=2,by.y=1)

cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = pcor.test(Exposure, length,c(purity,age),method="s")$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = pcor.test(Exposure, length,c(purity,age),method="s")$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Telomere length"
pvals$Feature<-"Telomere length"

integrated_matrix<-cbind("Telomere length",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.1&!is.na(integrated_matrix$qvalue<0.1),])

pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(length)/1000))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
  coord_cartesian(ylim = c(0,12))+
  geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=10, hjust=-0.05, vjust=0,parse = T)+
    geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=8, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
  strip.text.x = element_blank(),strip.background = element_blank())
p

pvals$pval<-p.adjust(pvals$pval,method="BH")

all_cor_dat<-rbind(all_cor_dat,cbind("Telomere length",pdat$Signature,pdat$Exposure,pdat$length))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)

Fold-back inversion events

Amplification associated fold-back inversion estimation

This file describes how to calculate the proportion of amplification associated fold-back inversions (ampFBI) from head-to-head inversion (h2hINV) type structural variants and copy-number information. For each sample, the no. of h2hINVs found in a 200kbp region centred around an amplified copy-number segment (CN >=5) was calculated. The proportion of ampFBI was defined as the no. of such h2hINVs relative to the total no. of structural variants in the sample. The X chromosome was left out of analyses.

suppressMessages(library(Biobase))
suppressMessages(library(GenomicRanges))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
pcawg_cohort_info <- read.table("data/pcawg_cohort_info.tsv", header=T, sep="\t", as.is = T)
pcawg_ids <- pcawg_cohort_info$tumor_wgs_aliquot_id
path <- "restricted_data/pcawg_sv/"
ext <- ".pcawg_consensus_1.6.161116.somatic.sv.bedpe.gz"
pcawg_ov_filelist  <- paste0(path, pcawg_ids, ext)

# Select ids for which both structural variant and copy-number information exist
ids <- pcawg_ids[pcawg_ids %in% names(pcawg_segTabs)]
segTable_colnames <- colnames(pcawg_segTabs[[1]])

ampFBI <- list()

for(sample_name in ids){
  # the no. of amplification-associated fold-back inversions in each sample
  n_amplifiedFBI <- 0 
  file_name <- paste0(filepath,sample_name,ext)
  sv <- read.table(file_name,header=T, sep ="\t", as.is = T)
  sv <- filter(sv, chrom1 != "X")
  totalSVs <- dim(sv)[1]
  h2h_count <- filter(sv,  svclass == "h2hINV") %>%
    summarise(n=n()) %>%
    .$n

  
  if(h2h_count>0){
    
    h2h_df <- filter(sv,  svclass == "h2hINV") %>%
      select(chrom1,start1,end2) %>%
      mutate(start1 = start1-1e5, end2 = end2+1e5) %>% # add 100KB to either side
      rename(chr=chrom1, start=start1, end=end2)
  
    h2h_granges <- makeGRangesFromDataFrame(h2h_df)
    seg_granges <- makeGRangesFromDataFrame(pcawg_segTabs[sample_name],  keep.extra.columns = T)
    # identify segments overlapping h2hINVs 
    hits <- findOverlaps(h2h_granges,seg_granges)
    # filter hits where copy-number >= 5 
    CN <- as.numeric(elementMetadata(seg_granges )[,1])
    highCNhits <- subjectHits(hits)[subjectHits(hits) %in% which(CN >=5)]
  
    if(length(highCNhits)>0){
      subset_hits <- hits[subjectHits(hits) %in% highCNhits]
      qHits <- queryHits(subset_hits)
      # count once if h2hINV overlaps multiple segments
      n_amplifiedFBI <- length(unique(qHits))/totalSVs
      
    }
   
  }
  
  ampFBI[[sample_name]] <- n_amplifiedFBI

}

ampFBI.df <- data.frame(names(ampFBI),unlist(ampFBI), stringsAsFactors = F)
colnames(ampFBI.df) <- c("ID", "amp_FBI")
write.table(ampFBI.df, "data/pcawg_amplified_FBI_fraction.txt", sep="\t", quote=F, row.names = F)

Correlation between number of FBI events and CN signature exposures

fbi<-read.table("data/pcawg_amplified_FBI_fraction.txt",sep="\t",header=T,stringsAsFactors = F)

pdat<-reshape2::melt(sig_pat_mat_pcawg)
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,fbi,by.x=2,by.y=1,all.x=T)
pdat<-pdat[!is.na(pdat$amp_FBI),]
pdat<-pdat[!pdat$amp_FBI==0,]

cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, amp_FBI,method="p",exact=T)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, amp_FBI,method="p",exact=T)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Amp FBI"
pvals$Feature<-"Amp FBI"

pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

p<-ggplot(pdat,aes(x=Exposure,y=as.numeric(amp_FBI)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
  coord_cartesian(ylim = c(0,0.1))+
  geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=0.1, y=0.075, hjust=-0.05, vjust=0,parse = T)+
    geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=0.1, y=0.085, hjust=-0.05, vjust=0,parse=T)+ ylab("")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=7),
  strip.text.x = element_blank(),strip.background = element_blank())
p

integrated_matrix<-cbind("Amplification associated fold-back inversions",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])

pvals$pval<-p.adjust(pvals$pval,method="BH")

all_cor_dat<-rbind(all_cor_dat,cbind("Amp FBI",pdat$Signature,pdat$Exposure,pdat$amp_FBI))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)

Age at diagnosis

age<-c(pcawg_surv_dat$donor_age_at_diagnosis,tcga_clin$age_at_initial_pathologic_diagnosis)
age<-data.frame(ID=c(rownames(pcawg_surv_dat),rownames(tcga_clin)),age,stringsAsFactors = F)
pdat<-reshape2::melt(cbind(sig_pat_mat_pcawg,sig_pat_mat_tcga))
colnames(pdat)<-c("Signature","ID","Exposure")
pdat<-merge(pdat,age,by.x=2,by.y=1)

cors <- plyr::ddply(pdat, c("Signature"), summarise, cor = cor.test(Exposure, age,method="p",exact=T)$estimate)
pvals <- plyr::ddply(pdat, c("Signature"), summarise, pval = cor.test(Exposure, age,method="p",exact=T)$p.value)
cors$Signature<-as.character(1:nsig)
pvals$Signature<-as.character(1:nsig)
cors$Feature<-"Age diagnosis"
pvals$Feature<-"Age diagnosis"
pdat$Signature<-plyr::revalue(pdat$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))


ggplot(pdat,aes(x=Exposure,y=as.numeric(age)))+geom_point(size=0.1)+geom_smooth(method = lm)+facet_wrap(~Signature,ncol = nsig,scales="free_x")+
  coord_cartesian(ylim = c(0,100))+
  geom_text(size=2,data=cors, aes(label=paste0("r== ", round(cor,2))),x=-Inf, y=15, hjust=-0.05, vjust=0,parse = T)+
    geom_text(size=2,data=pvals, aes(label=paste0("p== ", round(pval,4))), x=-Inf, y=5, hjust=-0.05, vjust=0,parse=T)+ ylab("Age at diagnosis")+theme_bw()+theme(axis.text=element_text(size=5),axis.title=element_text(size=5),
  strip.text.x = element_text(size = 5))

cors$Signature<-as.character(1:nsig)
integrated_matrix<-cbind("Age",cors[,1:2],pvals[,2],p.adjust(pvals[,2],method="BH"))
colnames(integrated_matrix)<-c("Variable","Signature","Correlation","pvalue","qvalue")
all_corrs<-rbind(all_corrs,integrated_matrix[integrated_matrix$qvalue<0.05&abs(integrated_matrix$Correlation)>0.25&!is.na(integrated_matrix$qvalue<0.05),])
pvals$pval<-p.adjust(pvals$pval,method="BH")

all_cor_dat<-rbind(all_cor_dat,cbind("Age diagnosis",pdat$Signature,pdat$Exposure,pdat$age))
all_cor_dat_cors<-rbind(all_cor_dat_cors,cors)
all_cor_dat_pvals<-rbind(all_cor_dat_pvals,pvals)

Plot of all correlations

Plot of all associations

BRCA1/2 germline carriers

#this chunk of code is not run be default due to data restrictions. 
#to run this code please download hrd_germline_w_CN_exposures.txt into the restricted_data directory and ensure previous restricted data code chunks are run
brca_cases<-read.table("restricted_data/hrd_germline_w_CN_exposures.txt",header=T,stringsAsFactors = F)
brca_loc<-gene_coords[gene_coords$gene=="BRCA1"|gene_coords$gene=="BRCA2",]
BRCAmuts<-HRmuts[(!HRmuts$Type=="nonsynonymous")&(!HRmuts$Gene=="WT"),]
BRCAmuts<-BRCAmuts[!is.na(BRCAmuts$Gene),]
BRCAmuts<-merge(brca_cases,BRCAmuts,by.x=1,by.y=1)

res<-c()
for(j in 1:nrow(BRCAmuts))
{
  name<-BRCAmuts[j,"Gene"]
  chr<-brca_loc[brca_loc$gene==name,"chr"]
  tss<-abs(as.numeric(brca_loc[brca_loc$gene==name,"tss"]))
  segTable<-all_segTabs[[BRCAmuts[j,"ID"]]]
  cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<tss&as.numeric(segTable[,3])>tss,4])
    if(!length(cn)==1)
    {
      modtss<-tss+1000000
      cn<-as.numeric(segTable[segTable[,1]==chr&as.numeric(segTable[,2])<modtss&as.numeric(segTable[,3])>modtss,4])
      if(!length(cn)==1)
      {
      cn<-NA
      }
    }
    res<-c(res,cn)
}
BRCAmuts$cn<-res
BRCAmuts$LOH<-res<1.4
BRCAmutloh<-BRCAmuts[BRCAmuts$LOH&BRCAmuts$Gene=="BRCA2","ID"]
pdat_sig<-reshape2::melt(sig_pat_mat_britroc_all)
colnames(pdat_sig)<-c("Signature","Patient","Exposure")

BRCAmutloh<-c("JBLAB-4195","JBLAB-4127","JBLAB-4162","IM_160")

pdat_brca<-pdat_sig[pdat_sig$Patient%in%BRCAmutloh,]

pdat_brca$Patient<-factor(pdat_brca$Patient,levels=c("JBLAB-4195","JBLAB-4127","JBLAB-4162","IM_160"))

pdat_brca$Signature<-plyr::revalue(pdat_brca$Signature,c(s1=1,s2=2,s3=3,s4=4,s5=5,s6=6,s7=7))

p<-ggplot(pdat_brca,aes(x=Patient,y=Exposure,fill=Signature))+geom_bar(stat="identity")+theme_bw()+
      scale_fill_manual( values = cbPalette)+
      theme(axis.text.x = element_blank(),axis.ticks.x=element_blank(),
            axis.text=element_text(size=7),axis.title=element_text(size=7),
            legend.text = element_text(size = 7),legend.title=element_text(size=7),
            legend.position='right',
            plot.margin = margin(0, 0, 0, 0, "cm"))+ylim(c(0,1))+xlab("BRCA2 germline mutation carriers + somatic LOH (n=4)")+
      ylab("Exposures")
p

Treatment response analysis

Here we used cases with archival samples (pre-treatment) and matched relapse samples (post-treatment) to explore whether treatment affected copy-number signatures, or if copy-number signatures could predict response.

#this code chuck is not run by default due to data restrictions. Please obtain the reg.csv file from the Britroc DACO and put it in the restricted_data directory to run
#extract exposures for cases with paried samples
clin<-read.table("restricted_data/reg.csv",sep=",",header=T,stringsAsFactors = F,quote="")
clin<-clin[,c("TRIALNO","PLATINUM")]
clin$PLATINUM<-plyr::revalue(as.character(clin$PLATINUM),c("1"="Sensitive","2"="Resistant"))
priorlines<-read.table("data/prior_lines.csv",sep=",",header=T,stringsAsFactors = F,quote="")
samp_annotation_clin<-merge(samp_annotation,clin,by.x=1,by.y=1,all.x=T)
samp_annotation_clin<-merge(samp_annotation_clin,priorlines,by.x=1,by.y=1,all.x=T)

samples<-samp_annotation_clin[samp_annotation_clin$IM.JBLAB_ID%in%colnames(sig_pat_mat_britroc_all),c(1,2,8,9,10)]


samples<-samples %>% 
  mutate(sample=ifelse(grepl("IM",IM.JBLAB_ID),"P","R")) %>%
  arrange(Britroc_No,desc(star_rating))%>%
  group_by(sample) %>% distinct(Britroc_No,.keep_all = TRUE)

samples[samples$IM.JBLAB_ID=="IM_91"|samples$IM.JBLAB_ID=="IM_70","sample"]<-"R"

samples[samples$sample=="P","PriorLinesChemo"]<-0
samples[is.na(samples$PriorLinesChemo),"PriorLinesChemo"]<-0

paired_samples<-samples %>% group_by(Britroc_No) %>% filter(n()>1) %>% as.data.frame

paired_dat<-reshape2::melt(sig_pat_mat_britroc_all[,colnames(sig_pat_mat_britroc_all)%in%paired_samples$IM.JBLAB_ID])
paired_dat<-merge(paired_dat,paired_samples,by.x=2,by.y=2)
colnames(paired_dat)<-c("ID","Signature","Exposure","Britroc_No","star_rating","status","prior_chemo","sample")
saveRDS(paired_dat,"data/paired_sample_details.rds")

Use a linear model to test if signature exposures change in repsonse to treatment:

library(broom)
paired_dat<-readRDS("data/paired_sample_details.rds")
curr_dat<-filter(paired_dat,Britroc_No!=45&Britroc_No!=32)
curr_dat<-tidyr::spread(curr_dat[,c(2,3,4,6,8)],key="sample",value="Exposure")

prepost_dat<-merge(curr_dat,surv_dat[,c("TRIALNO","AGE","PFS","OS")],by.x=2,by.y=1)
prepost_dat<-merge(prepost_dat,unique(paired_dat[paired_dat$sample=="R",c(4,7)]),by.x=1,by.y=1)

prepost_dat$status<-factor(prepost_dat$status,levels=c("Sensitive","Resistant"))
prepost_dat$change=prepost_dat$R-prepost_dat$P
prepost_dat$Signature=factor(prepost_dat$Signature,levels=paste0("s",c(5,1:4,6:8)))

#centre AGE
prepost_dat$AGE2<-(prepost_dat$AGE-mean(prepost_dat$AGE))/sqrt(var(prepost_dat$AGE))

colnames(prepost_dat)<-c("Britroc_No","Signature","status","diagnosis_exposure","relapse_exposure",
                         "AGE","time_to_relapse","overall_survival","prior_lines_chemo","change_in_exposure","centered_AGE")

#signature changes from diagnosis to relapse taking into account prior exposure, time, and age
diff_sig_time<-prepost_dat %>%
  group_by(Signature) %>%
  do(model=lm(log(relapse_exposure+1)~log(diagnosis_exposure+1)+time_to_relapse+centered_AGE+prior_lines_chemo,data=.)) %>% 
  tidy(model) %>% mutate(pval.adj = p.adjust (p.value, method='BH')) 

knitr::kable(diff_sig_time%>%
  filter(pval.adj<0.05)) %>%
  kableExtra::kable_styling(full_width = T)
Signature term estimate std.error statistic p.value pval.adj
s1 log(diagnosis_exposure + 1) 0.7257877 0.1539868 4.713311 0.0000607 0.0003036
s2 log(diagnosis_exposure + 1) 0.4980334 0.1685788 2.954305 0.0062877 0.0314385
s3 log(diagnosis_exposure + 1) 0.6304988 0.1528667 4.124500 0.0003007 0.0015034
s4 log(diagnosis_exposure + 1) 0.5691329 0.1360120 4.184431 0.0002558 0.0012790
s6 log(diagnosis_exposure + 1) 0.4897941 0.1575265 3.109281 0.0042791 0.0213956
s7 log(diagnosis_exposure + 1) 0.5559967 0.1659289 3.350813 0.0023182 0.0115908

Use a generalised linear model to test if signature exposures at diagnosis predict response:

#do signature predict response at diagnosis
diff_sig_response<-prepost_dat %>%
  group_by(Signature) %>%
  do(model=glm(I(status=="Sensitive")~diagnosis_exposure+centered_AGE,data=.,family="binomial")) %>%
  tidy(model) %>%mutate(pval.adj = p.adjust (p.value, method='BH')) 

knitr::kable(diff_sig_response%>%
  filter(pval.adj<0.05)) %>%
  kableExtra::kable_styling(full_width = T)
Signature term estimate std.error statistic p.value pval.adj
s1 (Intercept) 3.985971 1.3545777 2.942593 0.0032548 0.0097643
s1 diagnosis_exposure -7.521767 3.1208918 -2.410134 0.0159467 0.0239200
s6 (Intercept) 2.078200 0.7438572 2.793815 0.0052090 0.0156271